Open amcooksey opened 1 year ago
It looks like we will have to use the new version of Interproscan in order to use the updated databases. The json and xml outputs for the newer versions are substanitally bigger (60 M vs 1.8 G) but if we gzip those files we can keep the size down pretty well ( 1.8 G -> 207 M).
interproscan 5.45-80_3 (what we currently use) json -- 60M xml -- 9.3M json.gz -- 7.3M xml.gz -- 7M interproscan 5.54-87 (we rejected it because the outputs were too big) json -- 1.6G xml -- 1.2G json.gz -- 182M xml.gz --150 M interproscan 5.63-95 (latest version) json -- 1.8G xml -- 1.3G json.gz -- 207M xml.gz --171M
So, long story short, the newer versions have much larger outputs but they compress well.
I'm inclined to remove the json and xml from the output that we provide. That said, do the gff3, tsv and gaf files that we produce convey the same information that the xml and json files do?
We can definitely remove the json. I think we can remove the xml and the others will cover the same information. They will just be more difficult for someone to parse but I'm not sure anyone is doing that. I will double check with the Interpro people.
InterProScan support replied: The JSON and XML formats are the most comprehensive. For instance, InterProScan reports GO terms from two sources: InterPro and PANTHER. GO terms from both resources are reported in the JSON and XML formats, but only the InterPro GO terms are reported in the TSV/GFF3 format. We plan to address this but we are not sure when this will be done.
[We pull the GO from the XML into the GAF file so I think we can avoid this problem.]
Another example is the version of resources used in InterProScan. The XML and JSON formats report which version of InterPro, Pfam, etc. were used, but not the GFF3 and TSV formats.
[Our readme file specifies which version of Interproscan we used and that is associated with a specific set of analysis versions.]
Finally, if you want to keep the score or e-value of matches reported by InterProScan, only the XML and JSON formats include such information.
[Not sure how attached we are to the scores or evalues. Does anyone look at them?]
In a nutshell, we recommend using the XML and JSON formats, especially if you plan to keep results for a long time. But the TSV/GFF3 formats are also suitable in other cases, e.g. you are simply trying to check whether your sequence is annotated by a specific Pfam domain.
first thoughts on how to do updates: -re-run functional annotation pipeline -copy new functional annotation directory to analysis folder (preferrably on apollo-stage, otherwise CERES) -re-run final-workflow.cwl to generate genomic_annotated.gff (or possibly just the gff annotation portion; preferrably on apollo-stage) -remove NCBI ref track from apollo (there may be more steps necessary here) -add new NCBI ref track (there may be more steps necessary here) -push changes to apollo-prod, i5k-stage, i5k-prod -re-run createsymlinks -update tripal functional annotation page