geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Add PMID submission to NCBI Linkouts to pipeline #45

Open kltm opened 6 years ago

kltm commented 6 years ago

Add PMID submission to NCBI to pipeline for their linkback to us. Check with @cmungall and @thomaspd to get the FTP location, creds, etc.

kltm commented 6 years ago

Conditional on release.

kltm commented 6 years ago

Talking to @cmungall Rescope first step as producing the file for them to consume, 2-col TSV with PMID and amigo pub search page.

kltm commented 6 years ago

Well, currently, as we're linking directly back to AmiGO, which is being loaded by GAFs, we really just want all PMIDs, like:

$ reset && time zcat *.gaf.gz | cut -f 6 | sort | grep -o -P 'PMID\:\d+' | sort | uniq > /tmp/count.txt && wc -l /tmp/count.txt
real    35m27.405s
user    42m2.132s
sys 1m49.340s
151919 /tmp/count.txt

...that has got to be faster with a perl script or something...

pgaudet commented 5 years ago

@kltm Needs clarification from @thomaspd

pgaudet commented 5 years ago

@dustine32 has been working on this - @dustine32 Can you please update us on the status ?

Thanks, Pascale

dustine32 commented 5 years ago

@pgaudet I have a perl script (written by @xiaosonghuang) plugged into the monthly Panther/PAINT monthly update pipeline. This script spits out all PMID links to amigo and takes about 45 min to run (I think mainly because it also processes the humongous goa_uniprot_all.gaf).

The script to upload the links to PubMed's FTP server is also in the pipeline, though it may not be committed to github yet. I'll need to check.

dustine32 commented 5 years ago

@kltm I'm totally open to putting this into the main GO pipeline where it seems more appropriate.

dustine32 commented 5 years ago

This latest commit https://github.com/pantherdb/fullgo_paint_update/commit/e2307aed0a4ef754461758e35c50a51309d0e167 shows the required credential fields for NCBI FTP. I can send around our creds "super-secretly" if you guys don't have them.

dustine32 commented 5 years ago

@kltm Here's a link to the file submitted to PubMed the other day:

ftp://ftp.pantherdb.org/linkouts/2019-06-28/gaf2pmid_results

The FTP script I used to do this is here.

kltm commented 5 years ago

@dustine32 Great--thank you! That actually clears up a lot of questions I had about how this all works.

kltm commented 5 years ago

@dustine32 I think the one remaining question I would have is if there is a formal spec for the expected upload, or was it some for of backchannel email-y thing?

dustine32 commented 5 years ago

@kltm Oh, I guess this would be the spec. @cmungall actually found this a few months back.

dustine32 commented 5 years ago

@kltm Oh cute. I forgot to include the link to the PubMed spec. I'll check to see if my submission conforms to this:

https://www.ncbi.nlm.nih.gov/projects/linkout/doc/nonbiblinkout.html

dustine32 commented 5 years ago

Distilling what I think are the useful bits here:

We need at least two files in the FTP holdings/ folder: an identity file (providerinfo.xml) and an XML resource file, which can be named whatever (currently is GO_holdings.xml). The resource file lists the PubMed "objects" that the linkouts are tied to, in our case PMIDs.

Currently, both the identify file and resource file is setup to on our FTP folder with the resource file pointing to a third file, GO.uid, that's a simple text list of PMIDs (no PMID: prefix). Ex:

11728716
20713601
17470538
25538239
23580649

The GO.uid file is referenced from GO_holdings.xml with these lines:

<ObjectList>
    <FileName fieldname="uid">GO.uid</FileName>
</ObjectList>

Together, the two resource files will create the the full linkout URL (ex: http://amigo.geneontology.org/amigo/reference/PMID:11728716).

So on a month-to-month (release) basis, the only file that needs to be updated is GO.uid with the list of unique PMIDs from release GAFs.

kltm commented 5 years ago

@dustine32 That's great documentation, thank you. It's too bad they are doing a push instead of pull architecture.

kltm commented 5 years ago

@dustine32 I guess my last question is how did you initially get your credentials? Also, is there anybody we could tag to confirm your documentation above and give notice of the (eventual) transfer of this duty from PANTHER to GO Central?

kltm commented 5 years ago

@dustine32 I got your information at part of the email.

lpalbou commented 4 years ago

@dustine32 @kltm the file is now produced by the pipeline: http://current.geneontology.org/release_stats/GO.uid

Seems like @dustine32 could add a step to the pipeline to submit this file to NCBI ?

We may quite possibly want to solve that issue before: https://github.com/geneontology/pipeline/issues/202 . Dustin, where you dismissing those false PMID ? I didn't filter with a regular expression that would check that only numbers follow the ":". If not, we may as well submit this file even if the issue is not fixed before.

Tag @pgaudet to keep in the loop and for opinion on how to proceed.

dustine32 commented 4 years ago

@lpalbou Yeah, looks like the original perl script had a "numbers-only" regex:

push (@pmids,($ref =~ /PMID:(\d+)/g));

So PMID:workshop wouldn't have been output in the GO.uid file.

As for the "upload to PubMed" step, I can work on adding that into the pipeline basing it off my current script. A few requirements for this:

  1. ftp command on jenkins/docker/wherever this will go.
  2. The FTP account username/password should probably be stored somewhere sensible. I'm guessing locally on the machine in a config file?
  3. How will the trigger to actually push this to PubMed work? Guessing we can hook it up with the official "Approve" button that @pgaudet gets to push?
lpalbou commented 4 years ago

Ok, I will also filter out PMID: that would not have numbers afterwards. Still something to look at it as they still appear on AmiGO / public facing.

For 3., I would suggest this goes after or in parallel of the Zenodo upload.

kltm commented 4 years ago

Let's see, for:

  1. no problem, that should be there or easily had
  2. The credentials would be held in Jenkin's secrets
  3. It would go after Pascale's approval--somewhere in deploy or separately after. As @lpalbou mentions, it's probably worth thinking about what we mean by "release" here as far as where we put it in the end. If it's linking back to AmiGO, it may be best to defer it pretty late until we know AmiGO is updated (or the reverse depending on what type of problems we'd expect--but that might be overthinking it).
lpalbou commented 4 years ago

Also, hopefully stating the obvious but the upload to NCBI should only be done on the release branch and not on master/snapshot branches

pgaudet commented 4 years ago

Hi @lpalbou

Can we edit the filename so it's clearer what it is ? For example go-pmid.uid ?

Thanks, Pascale

lpalbou commented 4 years ago

Good question, I don’t think it’s possible without discussing it further with NCBI since this is the file/name they have been receiving and using for a long time. @dustine32 do you have any contact there ?

dustine32 commented 4 years ago

@lpalbou I did have a contact, searching through my email to find them. @kltm You might have it in a Github email from 7/9/19.

Also, we have some control over how that file's named in the GO_holdings.xml file that also lives in our NCBI FTP folder

  <ObjectSelector>
    <Database>PubMed</Database>
    <ObjectList>
      <FileName fieldname="uid">GO.uid</FileName>
    </ObjectList>
  </ObjectSelector>

In all, there are these three files in that FTP folder:

GO.uid
GO_holdings.xml
providerinfo.xml

Only GO.uid (or whatever we call it) should change regularly.

lpalbou commented 4 years ago

Hum, so assuming they do use the GO_holdings.xml properly, we could use a different filename. Still something to double check with them. Could you ?