geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Implement gorule-0000022 QC check for annotations to retracted publications #676

Closed pgaudet closed 4 months ago

pgaudet commented 6 years ago

From @vanaukenk on December 19, 2016 14:32

Hi, Following on from a help desk ticket: http://jira.geneontology.org/browse/GO-1431

Can we explore adding a QC check for annotations to retracted publications?

A possible approach might be:

PubMed indexes retracted publications in the PublicationTypeList tag. Here's an example (XML formatting not coming through):

PublicationTypeList

PublicationType UI="D016428" Journal Article PublicationType

PublicationType UI="D013485" Research Support, Non-U.S. Gov't PublicationType

PublicationType UI="D013486" Research Support, U.S. Gov't, Non-P.H.S. PublicationType

PublicationType UI="D016441" Retracted Publication PublicationType

PublicationTypeList

Perhaps implementing a periodic query to PubMed for articles with Type "Retracted Publication" and then checking those PMIDs against the PMIDs in the GO database would work.

Thx.

Copied from original issue: geneontology/go-annotation#1479

The corresponds to GAF column 6 /GPAD 1.1/2.0 column 5

kltm commented 6 years ago

This should possibly be related to the blacklist system. A "hot" query of this type would be part of the pipeline and done at the earliest stages, as part of the metadata get.

kltm commented 1 year ago

@pgaudet I think we need to touch bases on this as "low-hanging" fruit--there is a bit more here as we need to draw in external APIs, etc.

cmungall commented 11 months ago

@pgaudet "there is a file of all retracted PMIDs that is available for download"

This would be easier than using the API over all PMIDs

mugitty commented 11 months ago

@cmungall , What is the link to the file with the retracted PMIDs that is available for download? I did not know one existed.

When I looked into the NCBI API that would return the list of retracted PMID's, it would only return a maximum of 10K records. There are ~20k retracted publications. Retrieving all of the retracted publications requires, downloading an application.

kltm commented 11 months ago

Not to derail this conversation, and it's fine to work out where potential resources are, but I'd still like to touch bases with @pgaudet on this before proceeding.

cmungall commented 11 months ago

@mugitty - just paginate til you get them all?

What do you mean downloading an application?

mugitty commented 11 months ago

@cmungall, the URL to retrieve the retracted publications is: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Retracted+Publication[pt]&RetStart=0

Note, the response indicates that there are 19582 retracted publications.

The URL can be modified to return more than 20 records by specifying the max parameter: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Retracted+Publication[pt]&RetStart=0&RetMax=9999

When the URL is updated to retrieve the 10000th record and greater: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Retracted+Publication[pt]&RetStart=10000

The system responds with error message: ... Search Backend failed: Exception: 'retstart' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved. For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/ ...

This is not just NCBI. In the "absence of large data stores and web servers", this model of supporting "HIGH VOLUME" pagination with "LARGE DATASETS" via API is unsustainable.

There is a way to retrieve all the retracted publications, it is not via API. At a minimum, this list of retracted publications has to be updated with every "GO release". This will add the the release process overhead (tagging @kltm). The list of retracted publications can be stored as a YAML, JSON, etc. The GO rule validator would have to parse and keep in memory as a hash set (negligible impact) to cross check against the references.

deustp01 commented 11 months ago

Note, the response indicates that there are 19582 retracted publications.

... and only 10000 can be downloaded at a time.

Here is a truly ugly hack. Search for "Retracted publication"[pt] to get the list of 19,582 items. Then choose the "save" option from the bar just under the header, and choose format:PMID (all you need if your goal is to get a list to be checked against your list of references that you have relied on). That will get you a list of the first 10,000 starting from the oldest. Then on the same PubMed results page click on the little up-arrow in the small box next to the sort by: publication date box near the top of the page. That will get the first 10,000 starting from the newest. The two lists, concatenated and uniquified should be what you need. Truly, truly ugly, but it has the effect of getting a local copy of all retracted PMIDs that local code can check against a local list of PMIDs used for annotation, without hammering anyone's API.

Screenshot 2023-12-11 at 3 22 22 PM

A disturbing side note: the list goes back to 1951. The 20 most recent retractions appear to have happened since mid-August 2023, about 4 months. The oldest 20 took 27 years to accumulate, from 1951 to 1978. This also suggests that, at the current rate of retraction we are soon going to need a way of getting more than 2 x 10,000 items so an improvement on this hack will be needed.

kltm commented 11 months ago

Talking to @pgaudet this morning, and following up with conversation from last week with @mugitty, while this is certainly something we want to do, it's no longer in the immediate TODO list for this specific project. We'll keep it open in this project, as we want to make sure we have eyes on it for how to proceed, but we want to make sure we plan this out for stability and consistency given how our pipeline currently is working.

kltm commented 11 months ago

Note, the response indicates that there are 19582 retracted publications.

... and only 10000 can be downloaded at a time.

Here is a truly ugly hack [...] but it has the effect of getting a local copy of all retracted PMIDs that local code can check against a local list of PMIDs used for annotation, without hammering anyone's API.

We also have to keep in mind ToS of eutils, etc. Ideally, we would be able to grab an upstream file and simply filter against it. Second best is making that upstream file ourselves and maintaining it as best we can.

balhoff commented 11 months ago

Check out SemOpenAlex: https://semopenalex.org/

You can query for retracted PMIDs: https://api.triplydb.com/s/RpkYEr-qN

mugitty commented 8 months ago

@pgaudet, https://semopenalex.org/ was back up today. I ran the following query for retracted PMID's. This site only gives 7021 results where as we got over 20000 results on 20240321

The command I used to retrieve the results is: curl https://semopenalex.org/sparql --data query=PREFIX%20rdf%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX%20fabio%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fspar%2Ffabio%2F%3E%0APREFIX%20soa%3A%20%3Chttps%3A%2F%2Fsemopenalex.org%2Fontology%2F%3E%0ASELECT%20%2A%20WHERE%20%7B%0A%20%20%3Fpub%20fabio%3AhasPubMedId%20%3Fpmid%20.%0A%20%20%3Fpub%20soa%3AisRetracted%20true%20.%0A%7D -X POST > retracted.xml

or open browser to https://yasgui.triply.cc/#, enter the following query: PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX fabio: http://purl.org/spar/fabio/ PREFIX soa: https://semopenalex.org/ontology/ SELECT * WHERE { ?pub fabio:hasPubMedId ?pmid . ?pub soa:isRetracted true . }

and select page size 'All'.

I have attached the output from the query. retracted.csv

kltm commented 8 months ago

@pgaudet @mugitty I want to check in on this as we are now pulling in external files from external resources and needing to thread them into the system. If this is "low-hanging fruit", I want to make sure we're doing this is a robust and flexible way (e.g. file storage, update frquency).

mugitty commented 8 months ago

@kltm, the plan is for @pgaudet to create a file with all the retracted publications and @pgaudet will update at 'certain intervals'. The annotation parsers can use the file to check for retracted publications. This will free the pipeline from being dependent on undependable external resources.

kltm commented 8 months ago

I think I'm wanting to hammer out exact availability and frequency here. Naturally, once we have the file worked out, it will be made available statically in the pipeline (where ontobio will run). Generally speaking, when we start working with external resources on internal systems, we want to make sure that expectations and use are hammered out. (Typically, these kinds of things would be hammered out in the "architecture" portion of project planning, but since this has become a bit of a "rolling project, we haven't had a chance to do that this time. I just want to follow through on this part.)

kltm commented 8 months ago

~Talking to @pgaudet, things we need to work out:~ ~- [ ] what is the source~ ~- [ ] what format is the source~ ~- [ ] where will we "cache" the source~ ~- [ ] what is the update frequency of the source~ ~- [ ] where does the source get used, specifically (i.e. as an optional CLI arg to ontobio~ ~- [ ] how does the source get used, specifically (i.e. read into memory for each GAF/GPAD process)~ ~- [ ] does this affect runtime (as per-line; doubtful, as likely hask lookup or something, but something to be mindful of)~

Now see https://github.com/geneontology/project-management/issues/91 for discussion

pgaudet commented 8 months ago

Another possible source of the data is here: https://europepmc.org/betaSearch?query=%28PUB_TYPE%3A%22Retracted%20Publication%22%29&page=1

This has 21k publications, which seems the same number as in PubMed. This is downloadable.

We can set up a notification.

I attach the list of 21k PMIDs.

pgaudet commented 8 months ago

europepmc_id.txt

mugitty commented 8 months ago

@pgaudet, please add the file in version control so that the ontobio parser can refer to it.

Thanks

kltm commented 8 months ago

Talking to @cmungall , we'll want to try eUtils first (wanting authoritative), and talk to them (or the community) about bypassing the restrictions

pgaudet commented 8 months ago

Great! who can contact eutils?

kltm commented 8 months ago

@pgaudet Let's unpack some of this on the call tomorrow.

mugitty commented 8 months ago

The original issue with eutils was the 10k limitation. As long as we can get around it.

kltm commented 8 months ago

I think it will be possible, either using things like https://academia.stackexchange.com/questions/191088/how-can-i-get-around-the-10000-search-result-limit-in-pubmed, asking them.

pgaudet commented 8 months ago

Is the action for @mugitty to look at that forum to find how this can be done?

mugitty commented 8 months ago

I already looked into this and responded to this ticket on December 11, 2023..... Search Backend failed: Exception: 'retstart' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved. For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/....

It can be done. It is a question about who does it and how often.

kltm commented 8 months ago

https://github.com/geneontology/go-site/blob/master/metadata/rules/gorule-0000022.md

kltm commented 8 months ago

From morning discussion:

kltm commented 8 months ago

@mugitty just a note: for format and "line per pub", the formal format we'll be targeting with be CURIEs, not internal IDs.

mugitty commented 8 months ago

@mugitty just a note: for format and "line per pub", the formal format we'll be targeting with be CURIEs, not internal IDs.

Yes

kltm commented 8 months ago

Okay, quick play here, I think I have something with: retracted-publications.txt.

I pulled this out with:

esearch -db pubmed -query "Retracted Publication [pt]" | efetch -format pubmed > /tmp/pubmed-retracted.xml
cat /tmp/pubmed-retracted.xml | grep -oh ">[0-9]*<\/PMID>" | sort | uniq | cut -d '>' -f 2 | cut -d '<' -f 1 | sed 's/^/PMID:/'

If I were to do this again, I might try a different command, which would make the retractions more clear (I'm not 100% sure above, which is why I haven't committed it to the repo yet).

esearch -db pubmed -query "Retracted Publication [pt]" | efetch -format pubmed -mode asn.1 > /tmp/pubmed-retracted.txt

However, I seem to have hit some kind of query limit; best to try again later.

kltm commented 8 months ago

Okay, I'm not liking my file here. I think reprocessing Pascale's above is a good choice for now.

cat europepmc_id.txt | cut -d ',' -f 1 | sort | uniq > retracted-publications-2.txt

mugitty commented 8 months ago

@kltm, Do you want to add it to metadata somewhere, for now, and I can work off of it

Thanks

kltm commented 8 months ago

From a conversation with @mugitty , I wanted to clarify the current state.

pgaudet commented 7 months ago

@mugitty The europe-pmc-retracted.txt file is here:

~https://github.com/geneontology/go-site/blob/master/docs/europe-pmc-retracted.txt~

Moved to a better location: https://github.com/geneontology/go-site/blob/master/metadata/retracted-publications.txt

Updated today. I made a note in my calendar to update it monthly.

Thanks, Pascale

kltm commented 7 months ago

@pgaudet Maybe next week we can work out my third point here? https://github.com/geneontology/go-site/issues/676#issuecomment-2033029334

pgaudet commented 7 months ago

@kltm sure

But for now -

I did not add the metadata to the repo as I'm a little concerned that the europepmc file and the data I extracted with eutils are quite different,

@mugitty and I figured that if the europepmc file has most of the content then this is better than having no check at all.

If you have the complete file from eutils I can compare them; we suspect there are some synchronization issues?

mugitty commented 6 months ago

@kltm, @pgaudet, currently the retracted publications file is in docs. Do you want to move into metadata? I understand the contents and format will change.

kltm commented 6 months ago

@pgaudet The file is at https://github.com/geneontology/go-site/issues/676#issuecomment-2027818161

If you're putting this in, the filename should be generic, like "retracted-publications.txt" or the like. As well, adding a note to the README.md in that directory.

pgaudet commented 6 months ago

@mugitty will check status of this one.

pgaudet commented 4 months ago
pgaudet commented 4 months ago
pgaudet commented 4 months ago

Working on snapshot:

gorule-0000022

Check for, and filter, annotations made to retracted publications