Unmaintained annotation sets - potential issues

tonysawfordebi commented 6 years ago

I've been taking a look at the annotation sets that we currently import into the GOA database and that have been identified as being unmaintained, namely those from JCVI/TIGR and PAMGO_Mgrisea.

I've run them through our syntax checker to identify any issues that might prevent us being able to take ownership of them, and that has thrown up a few things that probably need some discussion, and some decisions to be taken about any action to be taken (remedial or otherwise).

I've attached a bunch of files to this ticket, two for each of the annotation sets; one is the raw output of the syntax check, and the other is a slightly more detailed analysis of the problems encountered.

analysis-JCVI.log analysis-PMGG.log analysis-TIGR.log syntax_check-JCVI.log syntax_check-PMGG.log syntax_check-TIGR.log

@pgaudet @ggeorghiou

cmungall commented 6 years ago

@dougli1sqrd will you share the results of our QC checks on these GAFs

ValWood commented 6 years ago

It would be useful for the QA group to know how many annotations from each source we are dealing with, and which species. Maybe we could mothball any annotation covered by an existing annotation from any source ? This might make them more manageable, and gradually they would be subsumed by annotations from maintained sources?

tonysawfordebi commented 6 years ago

The number of annotations in each set is listed in the summary section of each syntax_check-xxx.log file.

ggeorghiou commented 6 years ago

I've been having a look at these and it seems that the majority of the TIGR and JCVI annotations are ISS annotations that stem from HMM models. Would it be better to delete these since the parent annotation they are based off of may not be valid anymore?

ValWood commented 6 years ago

That's not a bad plan. Presumably any valid annotations will be covered by InterPro mappings. Is there an easy way to check that?

pgaudet commented 6 years ago

To add to Val's comment: we should also have quite a few PAINT IBAs as well by now.

Pascale

ggeorghiou commented 6 years ago

We could look at the annotations that map to valid UniProt identifiers and compare them to InterPro GO annotations that were generated for the same entry

dougli1sqrd commented 6 years ago

@tonysawfordebi @cmungall I don't think we have these GAFs in our standard datasets. Would you be able to provide a download url for these GAF datasets? We do have some reporting and sparql QC checks (the GO rules) through our pipeline, but currently only 3 or 4 of them are implemented as sparql checks so far (this is ongoing). Would the reports of those three rules still be of value?

tonysawfordebi commented 6 years ago

@dougli1sqrd The JCVI & TIGR annotation sets are both contained in http://www.geneontology.org/gene-associations/gene_association.jcvi.gz and the PMGG set is in http://www.geneontology.org/gene-associations/gene_association.PAMGO_Mgrisea.gz

tonysawfordebi commented 6 years ago

I've added a few more checks to our syntax checker & re-run the analysis of the three annotation sets; the results are attached.

analysis-JCVI.log analysis-PMGG.log analysis-TIGR.log syntax_check-JCVI.log syntax_check-PMGG.log syntax_check-TIGR.log

dougli1sqrd commented 6 years ago

Oh indeed, there they are, my mistake! I'll run these through our pipeline to get the rdf and run our GO Rules on them.

dougli1sqrd commented 6 years ago

Oh indeed, there they are, my mistake! I'll run these through our pipeline to get the rdf and run our GO Rules on them.

dougli1sqrd commented 6 years ago

I get no violations of the few GO Rules we have implemented in Sparql.

tonysawfordebi commented 6 years ago

Updated versions of the problem analysis logs:

analysis-JCVI.log analysis-PMGG.log analysis-TIGR.log syntax_check-JCVI.log syntax_check-PMGG.log syntax_check-TIGR.log

tonysawfordebi commented 6 years ago

Yet another version of the log analysis reports, with enhanced reporting:

analysis-JCVI.log analysis-PMGG.log analysis-TIGR.log

pgaudet commented 6 years ago

Hi @tonysawfordebi This is very useful, thanks !

Can you extract the number of valid/non valid annotations for each source ?

Thanks, Pascale

tonysawfordebi commented 6 years ago

@pgaudet Those numbers (if I'm interpreting your question correctly) are in the Summary section of the syntax_check-XXX.log files.

pgaudet commented 6 years ago

Great, thanks !

tonysawfordebi commented 6 years ago

I've started looking at those data cleansing operations that can be performed without having to apply any thought processes, and the first one is to replace secondary/obsolete GO terms with their replacements (if any).

This new set of logs shows the results of doing that: analysis-JCVI.log analysis-PMGG.log analysis-TIGR.log syntax_check-JCVI.log syntax_check-PMGG.log syntax_check-TIGR.log

pgaudet commented 6 years ago

Hi @tonysawfordebi Just wondering, why is this creating an error ?

Unsupported evidence code | [ECO:0000501 (IEA)] (In any case the annotation is outdated; we have this 1 year expiration date on IEAs - is this what the message means?)

Thanks again, Pascale

tonysawfordebi commented 6 years ago

Our syntax checker is actually the same code as our external annotation import parser, and that is configured to reject IEA annotations, as we don't import those from external groups.

tonysawfordebi commented 6 years ago

And speaking of parsers... I just discovered a bug in the piece of code that checks the validity of with/from and that lead to a number of false positives. I've tweaked it, and here are the latest logs:

analysis-JCVI.log analysis-PMGG.log analysis-TIGR.log syntax_check-JCVI.log syntax_check-PMGG.log syntax_check-TIGR.log

ggeorghiou commented 6 years ago

So looking at these, it seems like anything to with GO_REF:0000011 as a reference we could probably just purge to save ourselves the headache since they are based off HMM models. In addition, I see at least in the TIGR set IC annotations to annotations made from GO_REF:0000011 and GO_REF:0000012 we can get rid of as well since these are annotations made from non-experimental based evidence, which violates the rules for making IC annotations.

pgaudet commented 6 years ago

@tonysawfordebi Another question: I don't see the annotations that are assigned by other groups; is this information in the last column of the log file? (I am probably looking at the wrong column)

Thanks, Pascale

pgaudet commented 6 years ago

@tonysawfordebi In fact I dont see the number of valid/non valid annotations in the summary: since annotations can have more than one error if you add up all the errors you cannot tell how many pass all checks.

Can you please provide that ?

Thanks, Pascale

tonysawfordebi commented 6 years ago

@pgaudet The Summary section of the syntax_check log file contains a line labelled "Number of annotations assigned by other sources" - this is a count of the number of annotations in the GAF (or GPAD...) that have an assigned_by other than the one that we're interested in.

So, for example, in syntax_check-TIGR.log we have these three lines:

Total number of annotations: 162401 Number of annotations assigned by TIGR: 107255 Number of annotations assigned by other sources: 55146

which say that there are 55146 annotations in the GAF with something other than TIGR in the assigned_by column (and, in fact, in this case most of those will be assigned_by JCVI, as the TIGR and JCVI annotations are in the same file, gene_association.jcvi).

I've updated the parser to keep a count of the number of annotations in which no errors were logged, and that is output at the end of the Summary section of the syntax_check log. So, for example, in syntax_check-TIGR.log you'll now find this line:

Number of annotations with no errors: 59207

Here are updated versions of all the logs, created using the revised parser:

analysis-JCVI.log analysis-PMGG.log analysis-TIGR.log syntax_check-JCVI.log syntax_check-PMGG.log syntax_check-TIGR.log

tonysawfordebi commented 6 years ago

It occurred to me this morning that some of the errors that we're seeing - specifically those about invalid with/from components - are a direct result of the fact that the annotations are in GAF files (and so use classic GO evidence codes), while the rules governing the use of with/from are defined in terms of ECO terms.

So, what I've done now is hacked together a converter to transform GAF to GPAD which uses the rules defined in http://purl.obolibrary.org/obo/eco/gaf-eco-mapping.txt to convert GO evidence codes to equivalent ECO IDs.

In the TIGR/JCVI annotation sets there are a lot of annotations that have evidence code ISS and GO_REF:0000011, and these were previously being treated - and the with/from validated - as ISS annotations. However, using the rules in gaf-eco-mapping.txt, this combination of evidence and GO_REF is transformed in the GPAD file to ECO:0000255 (= ISM) rather than ECO:0000250 (= ISS), and this brings a different set of with/from rules into play.

Here's a new set of log files, which show the results of running the checker over the GPAD-ified version of the annotation sets:

syntax_check-TIGR.log analysis-JCVI.log analysis-PMGG.log analysis-TIGR.log syntax_check-JCVI.log syntax_check-PMGG.log

tonysawfordebi commented 6 years ago

I've modified the parser again to allow the use of GO_REF:0000012 with ISA and GO_REF:0000011 with ISM, and I've re-run the checks. Here are the logs:

analysis-JCVI.log analysis-PMGG.log analysis-TIGR.log syntax_check-JCVI.log syntax_check-PMGG.log syntax_check-TIGR.log

ValWood commented 6 years ago

annotations to high level terms like "metabolic process" could just be dropped. Not very useful.....

tonysawfordebi commented 6 years ago

Looking at the analysis-XXX.log files, it seems that the remaining problems fall mainly into these categories:

unmapped identifiers: there' nothing we can do about this - we need to be able to map to a UniProt accession (or RNAcentral or ComplexPortal ID) in order to be able to import annotations
annotations to obsolete GO terms, for which no replacement is available: we should probably just drop these
annotations to GO terms that are flagged as donot(manually_)annotate: we should probably just drop these too
annotations with an invalid combination of evidence code and with/from (as governed by the rules in https://github.com/geneontology/go-site/blob/master/metadata/eco-usage-constraints.yaml)

This last category is the most interesting.

In the TIGR annotation set, out of a total of 107255 annotations attributed to TIGR, 27437 of them are failing the with/from checks.

Of these 27437 annotations, 14262 are ISS annotations with a Pfam identifier in their with/from, and 11085 are ISS annotations with a TIGR_TIGRFAMS identifier in their with/from. The type of the entities supplied by these two databases is not compatible - according to the rules in eco-usage-constraints.yaml - with the ISS evidence code (strictly, not compatible with ECO:0000250); however, it is compatible with ECO:0000247 (ISA).

So, my question is this: would it make sense to transform these ISS annotations to ISA? If we did that, we might well be able to increase significantly the number of annotations that we are able to import.

ValWood commented 6 years ago

My personal opinion is that long term it isn't practical to maintain these mappings as ISS/ISA (in fact based on the fact that nobody looks at them, I would question whether they should be anything other than IEA). If they really are valid mapping based on protein families, they should be covered by more up to dat Interpro2GO mappings.... so they will invariably be redundant anyways....

tonysawfordebi commented 6 years ago

I'd be quite happy with that approach. Maybe we should say that we'll only integrate annotations with a "good" evidence code (whatever that means - manual experimental, probably).

ValWood commented 6 years ago

It's a pity there isn't a tool which would indicate if any of these provided any unique annotation. These could then be kept, or replaced and the rest could be dumped...

tonysawfordebi commented 6 years ago

Well, that is a possibility - I could look at putting something together that compares what's in the files with what's in our database.

ValWood commented 6 years ago

Let's see what other QC group think @sylvainpoux @pgaudet it might be nice to have some way to evaluate legacy annotation sets like this. What do you think?

ValWood commented 6 years ago

There are certainly useful annotations from these sets that are not covered. I came across one by chance yesterday:

http://www.uniprot.org/uniprot/O15182

annotated to "centrosome cycle" from ProtInc

tonysawfordebi commented 6 years ago

We are currently only concerned with the sets from JCVI/TIGR and PAMGO_Mgrisea. GOA already hosts and maintains the PINC annotation set, and has done since before I joined the EBI.

pgaudet commented 6 years ago

Small note:

Looking at the 'do_not_annotate terms, we could map

GO:0001539 cilium or flagellum-dependent cell motility to GO:0071973 bacterial-type flagellum-dependent cell motility for all bacterial annotations
GO:0000910 cytokinesis -> GO:0043093 FtsZ-dependent cytokinesis

pgaudet commented 6 years ago

@tonysawfordebi @ggeorghiou discussed this. The procedure is documented here: http://wiki.geneontology.org/index.php/Procedure_for_unmaintained_annotations

We will implement this for the following sources:

JCVI
PMGG
TIGR

tonysawfordebi commented 6 years ago

This is the breakdown of the number of annotations per source per species that we are currently importing into the GOA database:

Source	Taxon ID	Species	No. of Annotations
JCVI	223283	Pseudomonas syringae pv. tomato (strain ATCC BAA-871 / DC3000)	6083
JCVI	220664	Pseudomonas fluorescens (strain ATCC BAA-477 / NRRL B-23932 / Pf-5)	4832
JCVI	243233	Methylococcus capsulatus (strain ATCC 33009 / NCIMB 11132 / Bath)	4028
JCVI	264730	Pseudomonas savastanoi pv. phaseolicola (strain 1448A / Race 6)	3895
JCVI	228405	Hyphomonas neptunium (strain ATCC 15444)	2615
JCVI	265669	Listeria monocytogenes serotype 4b (strain F2365)	405
JCVI	414	Methylococcus capsulatus	25
JCVI	323	Pseudomonas syringae pv. tomato	5
PMGG	242507	Magnaporthe oryzae (strain 70-15 / ATCC MYA-4617 / FGSC 8958)	14434
TIGR	243277	Vibrio cholerae serotype O1 (strain ATCC 39315 / El Tor Inaba N16961)	8431
TIGR	1392	Bacillus anthracis	8340
TIGR	167879	Colwellia psychrerythraea (strain 34H / ATCC BAA-681)	6354
TIGR	246200	Ruegeria pomeroyi (strain ATCC 700808 / DSM 15171 / DSS-3)	5137
TIGR	243231	Geobacter sulfurreducens (strain ATCC 51573 / DSM 12127 / PCA)	4196
TIGR	246194	Carboxydothermus hydrogenoformans (strain ATCC BAA-161 / DSM 6008 / Z-2901)	2881
TIGR	243164	Dehalococcoides mccartyi (strain ATCC BAA-2266 / KCTC 15142 / 195)	2155
TIGR	212042	Anaplasma phagocytophilum (strain HZ)	2018
TIGR	205920	Ehrlichia chaffeensis (strain ATCC CRL-10679 / Arkansas)	1580
TIGR	222891	Neorickettsia sennetsu (strain ATCC VR-367 / Miyayama)	1188
TIGR	211586	Shewanella oneidensis (strain MR-1)	1155
TIGR	227377	Coxiella burnetii (strain RSA 493 / Nine Mile phase I)	323
TIGR	195099	Campylobacter jejuni (strain RM1221)	114
TIGR	666	Vibrio cholerae	112
TIGR	777	Coxiella burnetii	16
TIGR	89184	Ruegeria pomeroyi	8
TIGR	127906	Vibrio cholerae O1	3
TIGR	35554	Geobacter sulfurreducens	2

cmungall commented 6 years ago

Apologies, just catching up with this ticket. My comments are a bit late to be relevant

@dougli1sqrd

I get no violations of the few GO Rules we have implemented in Sparql.

Yes but we also have reports that are generated as the submission files are ingested into the submission. These are documented here: http://wiki.geneontology.org/index.php/Release_Pipeline

We implement an overlapping set of rules with GOA. Let's get together with @tonysawfordebi at the meeting and work together on this.

tonysawfordebi commented 6 years ago

@pgaudet Regarding the possible transformation of GO terms that you mentioned up there ^^^, I've looked at the files and the only annotations they contain to GO:0001539 and GO:0000910 are IEAs, so they wouldn't be imported anyway.

tonysawfordebi commented 6 years ago

I've now integrated the JCVI, TIGR and PMGG annotation sets into our database, so they will be subject to the regular sanity checks that we run (with the reports being sent to go-quality@mailman.stanford.edu) and can be maintained using Protein2GO.

This is the final set of logs from the import process - exclusions-XXX.log lists those problems that were detected by our parser / syntax checker, while load-XXX.log lists any problems that were detected during the load into the database.

As you'll see, the JCVI and TIGR sets have a lot of annotations that were excluded during the load because they refer to deleted or secondary UniProt accessions, and this is due to the fact that we rely on what we find in http://www.geneontology.org/gp2protein/gp2protein.jcvi.gz to map the MOD identifiers to UniProt accessions, and that hasn't been updated since 2012 (if not earlier...). If we can find some more up-to-date mappings, then we might be able to integrate more of the excluded annotations, but I'm not holding my breath.

exclusions-JCVI.log exclusions-PMGG.log exclusions-TIGR.log load-JCVI.log load-PMGG.log load-TIGR.log

pgaudet commented 6 years ago

Discussion with QC group (@ggeorghiou @ValWood @ValWood @sylvainpoux ) We can rescue a few more annotations:

Reference not appropriate for evidence code: GO_REF:0000011 Hidden Markov Models (TIGR) -> OK for ISS (or move evidence to ISM for all annotations with that ref) GO_REF:0000012 Pairwise alignment (TIGR) -> OK for ISS (or move evidence to ISA for all annotations with that ref)
With/from contains one or more invalid components: Looks like these include TIGRFam and PFAM IDs: are these really forbidden ? @tonysawfordebi : do you know if these are also obtained by the InterPro IEA pipeline ?

Thanks, Pascale

vanaukenk commented 6 years ago

@tonysawfordebi It'd be great to know, if we decide to remove the ISS annotations for the genomes listed above, if/what information content we might lose. Would be it be possible to compare existing ISS vs IEA or IBA to see what the overlap is? I'm happy to discuss in more detail, if need be. Thx.

ggeorghiou commented 6 years ago

Hi all,

Tony has had a look at what TIGR families are in the JCVI and TIGR annotation sets. We compared them to the ones that InterPro uses and found that only eight TIGR families are not used by InterPro:

TIGR00301 TIGR00650 TIGR01095 TIGR01199 TIGR01468 TIGR01576 TIGR02545 TIGR03386

There are only 222 annotations to these families. So looking at this from a practical standpoint, we should be fine getting rid of both TIGR and JCVI since almost all of it is covered by InterPro.

ValWood commented 6 years ago

This looks good, and almost certainly InterPro will have other families covering these where appropriate. It brings everything under standard consistency checks and ensure that annotations don't go stale going forward. I think it would be crazy not to do this....

pgaudet commented 6 years ago

I agree. The only thing is that Michelle Giglio said that these ISS annotations were done manually; would it be possible to find out what we'd be missing by replacing the ISS by IEAs (other than these 8 families), ie how different are the annotations ?

tonysawfordebi commented 6 years ago

Not terrifically easy - I'd essentially have to force-load the annotations somehow in order to be able to work out whether it's worth trying to load them... And then, once they're loaded the comparison of GO terms wouldn't be entirely straightforward, as manual annotations tend to use more specific terms than electronic ones.

Still, never say never...

ValWood commented 6 years ago

Oh right I misunderstood. I thought this was implying the mapping was the same as the ISS annotation. But it is only that the family is present in InterPro.

Tony there might be an easy way to do this... All we would need to check is if the GO ID associated with the TIGR family in GO annotation (ISS) is the same as the GO ID associated with the TIGR family in the InterPro mappings. Is that easier than loading annotations?

geneontology / go-annotation

Unmaintained annotation sets - potential issues #1724