Closed pgaudet closed 4 months ago
This should possibly be related to the blacklist system. A "hot" query of this type would be part of the pipeline and done at the earliest stages, as part of the metadata get.
@pgaudet I think we need to touch bases on this as "low-hanging" fruit--there is a bit more here as we need to draw in external APIs, etc.
@pgaudet "there is a file of all retracted PMIDs that is available for download"
This would be easier than using the API over all PMIDs
@cmungall , What is the link to the file with the retracted PMIDs that is available for download? I did not know one existed.
When I looked into the NCBI API that would return the list of retracted PMID's, it would only return a maximum of 10K records. There are ~20k retracted publications. Retrieving all of the retracted publications requires, downloading an application.
Not to derail this conversation, and it's fine to work out where potential resources are, but I'd still like to touch bases with @pgaudet on this before proceeding.
@mugitty - just paginate til you get them all?
What do you mean downloading an application?
@cmungall, the URL to retrieve the retracted publications is: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Retracted+Publication[pt]&RetStart=0
Note, the response indicates that there are 19582 retracted publications.
The URL can be modified to return more than 20 records by specifying the max parameter: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Retracted+Publication[pt]&RetStart=0&RetMax=9999
When the URL is updated to retrieve the 10000th record and greater: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Retracted+Publication[pt]&RetStart=10000
The system responds with error message: ... Search Backend failed: Exception: 'retstart' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved. For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/ ...
This is not just NCBI. In the "absence of large data stores and web servers", this model of supporting "HIGH VOLUME" pagination with "LARGE DATASETS" via API is unsustainable.
There is a way to retrieve all the retracted publications, it is not via API. At a minimum, this list of retracted publications has to be updated with every "GO release". This will add the the release process overhead (tagging @kltm). The list of retracted publications can be stored as a YAML, JSON, etc. The GO rule validator would have to parse and keep in memory as a hash set (negligible impact) to cross check against the references.
Note, the response indicates that there are 19582 retracted publications.
... and only 10000 can be downloaded at a time.
Here is a truly ugly hack. Search for "Retracted publication"[pt] to get the list of 19,582 items. Then choose the "save" option from the bar just under the header, and choose format:PMID (all you need if your goal is to get a list to be checked against your list of references that you have relied on). That will get you a list of the first 10,000 starting from the oldest. Then on the same PubMed results page click on the little up-arrow in the small box next to the sort by: publication date box near the top of the page. That will get the first 10,000 starting from the newest. The two lists, concatenated and uniquified should be what you need. Truly, truly ugly, but it has the effect of getting a local copy of all retracted PMIDs that local code can check against a local list of PMIDs used for annotation, without hammering anyone's API.
A disturbing side note: the list goes back to 1951. The 20 most recent retractions appear to have happened since mid-August 2023, about 4 months. The oldest 20 took 27 years to accumulate, from 1951 to 1978. This also suggests that, at the current rate of retraction we are soon going to need a way of getting more than 2 x 10,000 items so an improvement on this hack will be needed.
Talking to @pgaudet this morning, and following up with conversation from last week with @mugitty, while this is certainly something we want to do, it's no longer in the immediate TODO list for this specific project. We'll keep it open in this project, as we want to make sure we have eyes on it for how to proceed, but we want to make sure we plan this out for stability and consistency given how our pipeline currently is working.
Note, the response indicates that there are 19582 retracted publications.
... and only 10000 can be downloaded at a time.
Here is a truly ugly hack [...] but it has the effect of getting a local copy of all retracted PMIDs that local code can check against a local list of PMIDs used for annotation, without hammering anyone's API.
We also have to keep in mind ToS of eutils, etc. Ideally, we would be able to grab an upstream file and simply filter against it. Second best is making that upstream file ourselves and maintaining it as best we can.
Check out SemOpenAlex: https://semopenalex.org/
You can query for retracted PMIDs: https://api.triplydb.com/s/RpkYEr-qN
@pgaudet, https://semopenalex.org/ was back up today. I ran the following query for retracted PMID's. This site only gives 7021 results where as we got over 20000 results on 20240321
The command I used to retrieve the results is: curl https://semopenalex.org/sparql --data query=PREFIX%20rdf%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX%20fabio%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fspar%2Ffabio%2F%3E%0APREFIX%20soa%3A%20%3Chttps%3A%2F%2Fsemopenalex.org%2Fontology%2F%3E%0ASELECT%20%2A%20WHERE%20%7B%0A%20%20%3Fpub%20fabio%3AhasPubMedId%20%3Fpmid%20.%0A%20%20%3Fpub%20soa%3AisRetracted%20true%20.%0A%7D -X POST > retracted.xml
or open browser to https://yasgui.triply.cc/#, enter the following query: PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX fabio: http://purl.org/spar/fabio/ PREFIX soa: https://semopenalex.org/ontology/ SELECT * WHERE { ?pub fabio:hasPubMedId ?pmid . ?pub soa:isRetracted true . }
and select page size 'All'.
I have attached the output from the query. retracted.csv
@pgaudet @mugitty I want to check in on this as we are now pulling in external files from external resources and needing to thread them into the system. If this is "low-hanging fruit", I want to make sure we're doing this is a robust and flexible way (e.g. file storage, update frquency).
@kltm, the plan is for @pgaudet to create a file with all the retracted publications and @pgaudet will update at 'certain intervals'. The annotation parsers can use the file to check for retracted publications. This will free the pipeline from being dependent on undependable external resources.
I think I'm wanting to hammer out exact availability and frequency here. Naturally, once we have the file worked out, it will be made available statically in the pipeline (where ontobio will run). Generally speaking, when we start working with external resources on internal systems, we want to make sure that expectations and use are hammered out. (Typically, these kinds of things would be hammered out in the "architecture" portion of project planning, but since this has become a bit of a "rolling project, we haven't had a chance to do that this time. I just want to follow through on this part.)
~Talking to @pgaudet, things we need to work out:~ ~- [ ] what is the source~ ~- [ ] what format is the source~ ~- [ ] where will we "cache" the source~ ~- [ ] what is the update frequency of the source~ ~- [ ] where does the source get used, specifically (i.e. as an optional CLI arg to ontobio~ ~- [ ] how does the source get used, specifically (i.e. read into memory for each GAF/GPAD process)~ ~- [ ] does this affect runtime (as per-line; doubtful, as likely hask lookup or something, but something to be mindful of)~
Now see https://github.com/geneontology/project-management/issues/91 for discussion
Another possible source of the data is here: https://europepmc.org/betaSearch?query=%28PUB_TYPE%3A%22Retracted%20Publication%22%29&page=1
This has 21k publications, which seems the same number as in PubMed. This is downloadable.
We can set up a notification.
I attach the list of 21k PMIDs.
@pgaudet, please add the file in version control so that the ontobio parser can refer to it.
Thanks
Talking to @cmungall , we'll want to try eUtils first (wanting authoritative), and talk to them (or the community) about bypassing the restrictions
Great! who can contact eutils?
@pgaudet Let's unpack some of this on the call tomorrow.
The original issue with eutils was the 10k limitation. As long as we can get around it.
I think it will be possible, either using things like https://academia.stackexchange.com/questions/191088/how-can-i-get-around-the-10000-search-result-limit-in-pubmed, asking them.
Is the action for @mugitty to look at that forum to find how this can be done?
I already looked into this and responded to this ticket on December 11, 2023..... Search Backend failed: Exception: 'retstart' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved. For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/....
It can be done. It is a question about who does it and how often.
From morning discussion:
retracted-publications.txt
@mugitty just a note: for format and "line per pub", the formal format we'll be targeting with be CURIEs, not internal IDs.
@mugitty just a note: for format and "line per pub", the formal format we'll be targeting with be CURIEs, not internal IDs.
Yes
Okay, quick play here, I think I have something with: retracted-publications.txt.
I pulled this out with:
esearch -db pubmed -query "Retracted Publication [pt]" | efetch -format pubmed > /tmp/pubmed-retracted.xml
cat /tmp/pubmed-retracted.xml | grep -oh ">[0-9]*<\/PMID>" | sort | uniq | cut -d '>' -f 2 | cut -d '<' -f 1 | sed 's/^/PMID:/'
If I were to do this again, I might try a different command, which would make the retractions more clear (I'm not 100% sure above, which is why I haven't committed it to the repo yet).
esearch -db pubmed -query "Retracted Publication [pt]" | efetch -format pubmed -mode asn.1 > /tmp/pubmed-retracted.txt
However, I seem to have hit some kind of query limit; best to try again later.
Okay, I'm not liking my file here. I think reprocessing Pascale's above is a good choice for now.
cat europepmc_id.txt | cut -d ',' -f 1 | sort | uniq > retracted-publications-2.txt
@kltm, Do you want to add it to metadata somewhere, for now, and I can work off of it
Thanks
From a conversation with @mugitty , I wanted to clarify the current state.
@mugitty The europe-pmc-retracted.txt file is here:
~https://github.com/geneontology/go-site/blob/master/docs/europe-pmc-retracted.txt~
Moved to a better location: https://github.com/geneontology/go-site/blob/master/metadata/retracted-publications.txt
Updated today. I made a note in my calendar to update it monthly.
Thanks, Pascale
@pgaudet Maybe next week we can work out my third point here? https://github.com/geneontology/go-site/issues/676#issuecomment-2033029334
@kltm sure
But for now -
I did not add the metadata to the repo as I'm a little concerned that the europepmc file and the data I extracted with eutils are quite different,
@mugitty and I figured that if the europepmc file has most of the content then this is better than having no check at all.
If you have the complete file from eutils I can compare them; we suspect there are some synchronization issues?
@kltm, @pgaudet, currently the retracted publications file is in docs. Do you want to move into metadata? I understand the contents and format will change.
@pgaudet The file is at https://github.com/geneontology/go-site/issues/676#issuecomment-2027818161
If you're putting this in, the filename should be generic, like "retracted-publications.txt" or the like. As well, adding a note to the README.md in that directory.
@mugitty will check status of this one.
Working on snapshot:
Check for, and filter, annotations made to retracted publications
UniProtKB P55211 CASP9 enables GO:0008233 PMID:20663920 IDA F Caspase-9 CASP9|MCH6 protein taxon:9606 20120309 MGI UniProtKB:P55211
From @vanaukenk on December 19, 2016 14:32
Hi, Following on from a help desk ticket: http://jira.geneontology.org/browse/GO-1431
Can we explore adding a QC check for annotations to retracted publications?
A possible approach might be:
PubMed indexes retracted publications in the PublicationTypeList tag. Here's an example (XML formatting not coming through):
PublicationTypeList
PublicationType UI="D016428" Journal Article PublicationType
PublicationType UI="D013485" Research Support, Non-U.S. Gov't PublicationType
PublicationType UI="D013486" Research Support, U.S. Gov't, Non-P.H.S. PublicationType
PublicationType UI="D016441" Retracted Publication PublicationType
PublicationTypeList
Perhaps implementing a periodic query to PubMed for articles with Type "Retracted Publication" and then checking those PMIDs against the PMIDs in the GO database would work.
Thx.
Copied from original issue: geneontology/go-annotation#1479
The corresponds to GAF column 6 /GPAD 1.1/2.0 column 5