Some PANTHER families missing in AmiGO / GOlr

kltm commented 6 years ago

Some PANTHER families are missing from the public interface. To duplicate: go to http://amigo.geneontology.org/amigo/term/GO:0006171 and filter for evidence code = IBA -- some IBAs do not have associated families/

Originating notes https://github.com/geneontology/amigo/issues/498

tagging @pgaudet and @cmungall

kltm commented 6 years ago

The most likely cause is an out-of-date version of PANTHER being used by the loader (we may be stuck in an earlier version).

kltm commented 6 years ago

Note: https://github.com/geneontology/panther-release-wrangling

kltm commented 6 years ago

Why shouldn't this just be a pipeline step, prepared on disk in a preserved directory, made available to the docker loader image?

kltm commented 6 years ago

In talks with @dustine32 on gitter.

kltm commented 6 years ago

theeeeere it is: /home/sjcarbon/local/src/svn/geneontology.org/trunk/experimental/trees/fix-panther.pl

kltm commented 6 years ago

Questions for PANTHER, via @dustine32:

It seems that there used to be some .arbre files in PANTHER releases that no longer seem to be available? If true, at least since before v10.
https://github.com/geneontology/panther-release-wrangling seem to be out of date, and I suspect that only 1 and 5 are even slightly relevant. Is this true?
- Given this, what is the best current way of accessing files, that follow the given wget pattern as close as possible?
- The wget seems to work with for releases "10" and "13", but nothing else, including the current "13.1". Is this correct?

kltm commented 6 years ago

Note: https://github.com/geneontology/go-site/issues/112 Note: https://github.com/geneontology/go-site/issues/187

kltm commented 6 years ago

I'm looking at what is currently in SVN (which we will be ignoring moving forward). The main (only?) difference between the .tree and .arbre file seem to be the inclusion of a family name as a "header". In the case of PTHR10003:

NAME=SUPEROXIDE DISMUTASE [CU-ZN]-RELATED

is the first line.

I now suspect (as @cmungall ) that this file is somehow a product of a process at our end historically, but that may not be true--still looking. Also, as far as the SVN materials go, the files seemed to have been renamed, so that things like tree.tree are PTHR10003.tree.

kltm commented 6 years ago

A hint in email from 2013-03-04:

> I'm almost done working through the changes necessary for the new trees.
> Just to verify, your custom data is in *.arbre, right?

Right. customized is in *.arbre

kltm commented 6 years ago

@dustine32 Okay, I think I know where we're at. We'd like to ask for a couple of things:

data.pantherdb.org seems really slow, with the given wget command not completing after an hour on a fast machine on a fast network (wget -q -S -N -r -nH -np --cut-dirs=2 -A attr.tab,tree.tree,tree.mia,cluster.wts -R cluster.fasta,cluster.ortholog,"hmm.*",cluster.pir,tree.sfan,index.html -X "/PANTHER10.0/books/*/SF*" -P panther data.pantherdb.org/PAINT_PANTHER10.0/books/)
- Could we help you get the files into a bundle and/or onto S3/CloudFront for fast access?
The mysterious .arbre files appear to be (renamed?) .tree files with the family name as the header (https://github.com/geneontology/amigo/issues/532#issuecomment-414864117).
- Could we either ask that you replicate these files or supply a simple step to replicate them?

kltm commented 6 years ago

.arbre examples: https://github.com/owlcollab/owltools/blob/master/OWLTools-Solr/src/test/resources/PTHR10000.arbre https://github.com/owlcollab/owltools/tree/master/OWLTools-Solr/src/test/resources/panther_data

dustine32 commented 6 years ago

Looks like we store the .arbre header name values in classification.name where accession = 'PTHR#####'. Should be easy to retrieve.

dustine32 commented 6 years ago

@kltm Oh I didn't think about this until just now: these tree files (.arbre, .tree, etc.) would only need to be downloaded whenever we update the library, which is about once a year. These files wouldn't be something that changes every month in response to the GO ontology or GAFs.

I still think we should investigate a cloud-based file server of some kind so downloads aren't deathly slow. I can chat with @lpalbou about options when he gets here (today or tomorrow?).

kltm commented 6 years ago

@dustine32 A note about that: I think it would be better to act as if you were on the same release schedule (or faster) than us; having your data made available on a (fast :) server will make things generally much smoother. From experience (and looking at years and years of notes yesterday), there is always issue with treating things like a special case. That reasoning, for example, is what led us to commit PANTHER releases into SVN at need, changing data sources, etc. As is evident from this comment thread, the history here is long, complicated, and hard to piece together (way) after the fact. We'd like to make sure that doesn't happen anymore ;) As an alternative, we could also model this after the way we are going to do PAINT GAF production.

kltm commented 6 years ago

Note: http://data.pantherdb.org/PANTHER13.1/books/PTHR10000/

What we've talked about with @mugitty and @dustine32 is that PANTHER will add a TBD file to the current structure that will contain the additional metadata (e.g. book -> family name mapping) that we need. We'll recreate the .arbre files at our end, and move forward.

As well, we'll add the PANTHER version as a pipeline variable.

pgaudet commented 6 years ago

Great !

dustine32 commented 6 years ago

Oh hey! Guess what's already sitting here: http://data.pantherdb.org/PANTHER13.1/globals/names.tab

$ head names.tab
PTHR10000.mag.mod   PHOSPHOSERINE PHOSPHATASE
PTHR10000.SF23.mod  SUBFAMILY NOT NAMED
PTHR10000.SF49.mod  SUGAR PHOSPHATASE YIDA
PTHR10000.SF47.mod  SUBFAMILY NOT NAMED
PTHR10000.SF8.mod   PYRIDOXAL PHOSPHATE PHOSPHATASE YBHA
PTHR10000.SF50.mod  SUBFAMILY NOT NAMED
PTHR10000.SF25.mod  SUBFAMILY NOT NAMED
PTHR10003.mag.mod   SUPEROXIDE DISMUTASE  CU-ZN -RELATED
PTHR10003.SF37.mod  SUBFAMILY NOT NAMED
PTHR10003.SF62.mod  SUPEROXIDE DISMUTASE [CU-ZN] 4-RELATED

This is very close to the lookup file @mugitty prepared, you'd just need to filter for *.mag.mod lines and strip off the .mag.mod. @kltm Would you be fine handling this cleanup step on your end or do you want a separate, clean version deposited?

pgaudet commented 6 years ago

@kltm This is more serious than I thought because the PTN-> PTHR mappings are from an old version. For example: PANTHER:PTN001117529 is mapped to phr22936 but in Panther v13, it is PTHR45965.

(this example will help to test when this has been done).

Thanks, Pascale

kltm commented 6 years ago

Note that @dustine32 has now provided: http://data.pantherdb.org/PANTHER13.1/globals/tree_files.tar.gz http://data.pantherdb.org/PANTHER13.1/globals/names.tab Assuming these are long-term stable, this should be sufficient to replace what we have with the pipeline.

pgaudet commented 6 years ago

@pgaudet to check at next AmiGO release

kltm commented 5 years ago

Next steps are to:

[x] get the .arbre files that we want to use onto skyhook (script)
[x] update the docker image run-indexer.sh to grab and assemble data (instead of SVN)

kltm commented 5 years ago

testing

kltm commented 5 years ago

Local test on local AmiGO seems to indicate that it worked out okay.

kltm commented 5 years ago

In final load, no PANTHER families are apparent. Elevating and investigating.

kltm commented 5 years ago

Possibly an issue around wget availability in the image.

kltm commented 5 years ago

Fixes finally out in release.

geneontology / amigo

Some PANTHER families missing in AmiGO / GOlr #532