dami82 / easyPubMed

easyPubMed package for R - dev version
21 stars 8 forks source link

Number of records retrieved with batch_pubmed_download() don't match the site #4

Open JFormoso opened 4 years ago

JFormoso commented 4 years ago

Hi! I can't figure out what I am doing wrong. Pubmed shows 66 records and the df resulting from this code returns 27. I've tried altering the sintax of the string and I always get the same result. If anyone can point me in the right direction... Thanks!

busqueda <- '((inference) AND (verbal ability)) AND (comprehension)'

output <- batch_pubmed_download(pubmed_query_string = busqueda, dest_file_prefix = "NUBL18", encoding = "ASCII")

archivo <- output[[1]]

base <- table_articles_byAuth(pubmed_data = archivo, included_authors = "first", max_chars = -1, encoding = "ASCII")

dami82 commented 4 years ago

Hello,

Thanks for reporting this issue. I could reproduce your results, and there is no error in the code. The count difference you observed between the PubMed website and easyPubMed is a current known issue. in May 2020 NCBI rolled out a new version of PubMed (web). The newly released PubMed web service is built on top of a new database, while programmatic access is only supported for the legacy (old) PubMed. This is the main reason for the query count differences. Unfortunately, I cannot do anything about this until NCBI releases a new API... this has was announced but is not available yet (nor it is known when it will be released). Unfortunately, I anticipate that this problem may persist for at least few more months...

Best regards.

dami82 commented 4 years ago

Hello,

I just upgraded my easyPubMed source on GitHub. I added a new function, i.e. fetch_pubmed_data_by_PMID(), that supports retrieving PubMed data using a list of PMIDs (for example, you can use a PMID list exported from PubMed website). Note: this can be used as a workaround to retrieve PubMed data matching the results of a web PubMed query. Also note that now MeSH term codes are automatically extracted and returned as part of the output. A vignette is attached. Let me know what you think.

Best regards, Damiano

Il giorno ven 26 giu 2020 alle ore 17:56 JFormoso notifications@github.com ha scritto:

Hi! I can't figure out what I am doing wrong. Pubmed shows 66 records and the df resulting from this code returns 27. I've tried altering the sintax of the string and I always get the same result. If anyone can point me in the right direction... Thanks!

busqueda <- '((inference) AND (verbal ability)) AND (comprehension)'

output <- batch_pubmed_download(pubmed_query_string = busqueda, dest_file_prefix = "NUBL18", encoding = "ASCII")

archivo <- output[[1]]

base <- table_articles_byAuth(pubmed_data = archivo, included_authors = "first", max_chars = -1, encoding = "ASCII")

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dami82/easyPubMed/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ5HIDTPEY3X6L3HS7RYK3RYUKRFANCNFSM4OJWFBOQ .

SchmidtPaul commented 3 years ago

The count difference you observed between the PubMed website and easyPubMed is a current known issue. [...] I cannot do anything about this until NCBI releases a new API [...] I anticipate that this problem may persist for at least few more months.

Hi there, I am assuming there is nothing new on this topic, right? Either way, just want to let you know that I am eager for this as well and that the package is great. For completeness, here is a reproducible example

PubMed website

image

easyPubMed

easyPubMed::get_pubmed_ids("patience")$Count
#> [1] "1903"

Created on 2021-01-12 by the reprex package (v0.3.0.9001)

pintodossantos commented 2 years ago

Will the new version with fetch_pubmed_data_by_PMID() be pushed to CRAN?

Epi-Emma commented 9 months ago

Hello, I'm just checking on the status of this issue with the release of 3.03. I am doing a very large query, which returns 19,785 records on PubMed and 19,637 records from my easyPubMed fetch using the same syntax (publications from 2020+ -- syntax excluded due to length, but happy to share). The N's are very similar, but I'm missing just over 100 records from the easyPubMed fetch. Do you know why there might be differences in the N's? Thank you!

dami82 commented 9 months ago

Hi Epi-Emma, thanks for testing the latest version of easyPubMed. If you share the query string, I'll look into this. Also, if you download the records using the approach outlined below, you may get additional info. Note: this assumes you want to write the XML records to a local folder.

q <- 'your query string'
x <- epm_query(query_string = q)
x <- epm_fetch(x, write_to_file = TRUE)
x

Once you print/show the object x, you'll get info about number of expected records and number of retrieved records. This way, you can see if there is a mismatch. Likely, the expected number will match what you see on the website, while the fetched_record number may differ. Is this the case? Note: this won't print all x contents to console, don't worry.