Closed kltm closed 5 years ago
The most likely cause is an out-of-date version of PANTHER being used by the loader (we may be stuck in an earlier version).
Why shouldn't this just be a pipeline step, prepared on disk in a preserved directory, made available to the docker loader image?
In talks with @dustine32 on gitter.
theeeeere it is: /home/sjcarbon/local/src/svn/geneontology.org/trunk/experimental/trees/fix-panther.pl
Questions for PANTHER, via @dustine32:
.arbre
files in PANTHER releases that no longer seem to be available? If true, at least since before v10.wget
seems to work with for releases "10" and "13", but nothing else, including the current "13.1". Is this correct?I'm looking at what is currently in SVN (which we will be ignoring moving forward).
The main (only?) difference between the .tree
and .arbre
file seem to be the inclusion of a family name as a "header". In the case of PTHR10003:
NAME=SUPEROXIDE DISMUTASE [CU-ZN]-RELATED
is the first line.
I now suspect (as @cmungall ) that this file is somehow a product of a process at our end historically, but that may not be true--still looking.
Also, as far as the SVN materials go, the files seemed to have been renamed, so that things like tree.tree
are PTHR10003.tree
.
A hint in email from 2013-03-04:
> I'm almost done working through the changes necessary for the new trees.
> Just to verify, your custom data is in *.arbre, right?
Right. customized is in *.arbre
@dustine32 Okay, I think I know where we're at. We'd like to ask for a couple of things:
wget -q -S -N -r -nH -np --cut-dirs=2 -A attr.tab,tree.tree,tree.mia,cluster.wts -R cluster.fasta,cluster.ortholog,"hmm.*",cluster.pir,tree.sfan,index.html -X "/PANTHER10.0/books/*/SF*" -P panther data.pantherdb.org/PAINT_PANTHER10.0/books/
)
.arbre
files appear to be (renamed?) .tree
files with the family name as the header (https://github.com/geneontology/amigo/issues/532#issuecomment-414864117).
Looks like we store the .arbre header name values in classification.name where accession = 'PTHR#####'
. Should be easy to retrieve.
@kltm Oh I didn't think about this until just now: these tree files (.arbre, .tree, etc.) would only need to be downloaded whenever we update the library, which is about once a year. These files wouldn't be something that changes every month in response to the GO ontology or GAFs.
I still think we should investigate a cloud-based file server of some kind so downloads aren't deathly slow. I can chat with @lpalbou about options when he gets here (today or tomorrow?).
@dustine32 A note about that: I think it would be better to act as if you were on the same release schedule (or faster) than us; having your data made available on a (fast :) server will make things generally much smoother. From experience (and looking at years and years of notes yesterday), there is always issue with treating things like a special case. That reasoning, for example, is what led us to commit PANTHER releases into SVN at need, changing data sources, etc. As is evident from this comment thread, the history here is long, complicated, and hard to piece together (way) after the fact. We'd like to make sure that doesn't happen anymore ;) As an alternative, we could also model this after the way we are going to do PAINT GAF production.
Note: http://data.pantherdb.org/PANTHER13.1/books/PTHR10000/
What we've talked about with @mugitty and @dustine32 is that PANTHER will add a TBD file to the current structure that will contain the additional metadata (e.g. book
-> family name
mapping) that we need. We'll recreate the .arbre
files at our end, and move forward.
As well, we'll add the PANTHER version as a pipeline variable.
Great !
Oh hey! Guess what's already sitting here: http://data.pantherdb.org/PANTHER13.1/globals/names.tab
$ head names.tab
PTHR10000.mag.mod PHOSPHOSERINE PHOSPHATASE
PTHR10000.SF23.mod SUBFAMILY NOT NAMED
PTHR10000.SF49.mod SUGAR PHOSPHATASE YIDA
PTHR10000.SF47.mod SUBFAMILY NOT NAMED
PTHR10000.SF8.mod PYRIDOXAL PHOSPHATE PHOSPHATASE YBHA
PTHR10000.SF50.mod SUBFAMILY NOT NAMED
PTHR10000.SF25.mod SUBFAMILY NOT NAMED
PTHR10003.mag.mod SUPEROXIDE DISMUTASE CU-ZN -RELATED
PTHR10003.SF37.mod SUBFAMILY NOT NAMED
PTHR10003.SF62.mod SUPEROXIDE DISMUTASE [CU-ZN] 4-RELATED
This is very close to the lookup file @mugitty prepared, you'd just need to filter for *.mag.mod
lines and strip off the .mag.mod
. @kltm Would you be fine handling this cleanup step on your end or do you want a separate, clean version deposited?
@kltm This is more serious than I thought because the PTN-> PTHR mappings are from an old version. For example: PANTHER:PTN001117529 is mapped to phr22936 but in Panther v13, it is PTHR45965.
(this example will help to test when this has been done).
Thanks, Pascale
Note that @dustine32 has now provided: http://data.pantherdb.org/PANTHER13.1/globals/tree_files.tar.gz http://data.pantherdb.org/PANTHER13.1/globals/names.tab Assuming these are long-term stable, this should be sufficient to replace what we have with the pipeline.
@pgaudet to check at next AmiGO release
Next steps are to:
testing
Local test on local AmiGO seems to indicate that it worked out okay.
In final load, no PANTHER families are apparent. Elevating and investigating.
Possibly an issue around wget
availability in the image.
Fixes finally out in release.
Some PANTHER families are missing from the public interface. To duplicate: go to http://amigo.geneontology.org/amigo/term/GO:0006171 and filter for evidence code = IBA -- some IBAs do not have associated families/
Originating notes https://github.com/geneontology/amigo/issues/498
tagging @pgaudet and @cmungall