geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

ProtInc GO annotations to delete #3246

Open ValWood opened 4 years ago

ValWood commented 4 years ago

https://www.uniprot.org/uniprot/P61088

cellular protein modification process (this is just a pointless term/annotation and we should add a "do not annotate" flag (and hopfully one day even get rid of the term)

proteolysis (this pathway seems to be involved in signalling for repair and transcription, but not degradation?)

I would delete any other annotations to these two terms from ProtInc (I am even wondering if there is anything useful in these annotations to keep that isn't alredy presnet from elsewhere? How many ProtinInc annotations are left?)

pgaudet commented 4 years ago

Deleted. This was the only annotation by PINC to 'cellular protein modification process'. I'll check proteolysis.

Thanks, Pascale

RLovering commented 4 years ago

Please note that the PINC annotations do sometimes provide links to useful papers to curate. Although I agree the quality of these are variable, especially because the regulation terms did not exist when the PINC annotations were created. I guess ideally a matrix for these identifying those annotations that overlap with similar annotations from other sources might be useful but not straightforward to do. Protein2GO annotations can be manipulated in a high-throughput way (at least via addition) so if a set of PINC annotations that are not required was identified Alex might be able to help remove them

pgaudet commented 4 years ago

PINC annotations to proteolysis and children are now here: https://docs.google.com/spreadsheets/d/1KhXuy3eyHqUlaor0AMsLQyggV9gKwFJyaG_5F4Z8ZI8/edit#gid=0

ValWood commented 4 years ago

How about asking Alex to remove all the redundant PINC annotations as a first step ? We could do this with all no longer maintained datasets.

That would leave a smaller set to evaluate.

Matrix evaluation would be useful as the next step. However, it isn't possible to filter on data source. This would need resources to extend the MAtrix tool (there are other requested additions that would make the tool more useful, but I don't know if these. will be addressed). @cmungall

The remaining list will presumably refer to either incorrect annotations or useful publications.

ValWood commented 4 years ago

I glanced at the spreadsheet. Many are clearly proteolysis but will almost certainly have this annotation from other sources.

Some are possibly not proteolysis (NEDDYLATION).

It would be easier to deal with these if we could persuade @alexsign to perform the filtering step.

In fact, I have proposed many, many times that we should PERMANENTLY filter any TAS/ NAS/ IC out of GO if there is evidence from a supported source. I just don't know why we keep it..... clear the clutter!

alexsign commented 4 years ago

@ValWood @pgaudet @RLovering I have no issue to filter/remove annotations which does not provide any useful information. However, I would like to know the agreement on exact rules I would have to use.

ValWood commented 4 years ago

@pgaudet @RLovering should we include this one the GO meeting agenda. Do you see any value in keeping the data if there is already experimental data?

It would make checking ProtInc etc much easier if the set was much reduced, but why not do it across the board. If an annotation ahs not been replaced by now with an experiment is should be either I) an annotation priority or ii) a red flag for a potential error...

RLovering commented 4 years ago

The list of curated papers is sometimes helpful, if the paper does have expt data it is nice to have more than 1 paper supporting an assertion. However, Protein2GO does have a field for the deleted annotations so potentially someone could look at this section if a literature search retrieved a high volume of papers to review.

I wish there was the same ratio of curators per article describing human genes as there are in PomBase for Pombi papers and that the community contributed to the first pass of curation the existing and new human articles. However, this is simply not the case. For example if you look at the number of annotations (or unique GO terms) associated with each of the interleukins and compared this with the number of articles describing each interleukin you would see that although the number of articles does vary considerably (IL6 has 120,485 results, IL11 2,347) the number of GO annotations does not provide a good summary of the data available for just this set of genes. P05231 IL6 182 annotations 100 GO IDs 3 from PINC (probably these could be deleted, as 2 overlap with other IDs and 1 contradicts the other annotation (neg v pos regulation term). But who has the time to read the paper? P20809 IL11 35 annotations 19 GO IDs 0 from PINC - I don't think 19 GO terms are likely to describe the data presented in 2000 IL11 papers.

While doing the dbTFs Pascale and I have removed a lot of PINC annotations, but if the only annotations describing that activity/process/location was provided by PINC then it took a while to find expt so that the established knowledge was not lost. With an excess of 2/3rd of papers not providing species information for human/mouse/rat expts it is not straightforward to find expt data to replace PINCs and I am not convinced it is worth while to replace a PINC TAS/NAS with a UCL TAS/NAS. Certainly I would not want to see a GO annotation removed if it is describing a 'well known' aspect, only provided by a PINC.

But I do agree that somehow errors need to be removed. At present the GWAS community seem to be finding Reactome provides a more reliable source of gene groups for their analysis than GO, which is very disappointing to me. I do worry about problems with the data that is mapped from Reactome to GO, which I do report when I notice them.

ValWood commented 4 years ago

Note that, I am only suggesting that the TAS annotations Supported by an experimental annotation are removed.

This would make it much clearer which annotations are not covered and need to be tracked down.

ValWood commented 4 years ago

Further checks would then be required to get rid of the out-of-date annotations, but that could come later. Filtering would only be step 1. Presumably filtering unmaintained resource would only be filtering very old papers. If the experiments were believed and had been reproduced, they would presumably be cited by the newer papers.

We need to be very cautious about curating data from papers over >10 years old that isn't reproduced or re-reported in later papers. If I was curating human I would begin with the newest papers from the high quality journals and work backwards picking out the papers that the recent papers cited to fill in the backlog.

ValWood commented 4 years ago

I wish there was the same ratio of curators per article describing human genes as there are in PomBase

I wish there was the same ratio of curators per gene as there are in human ;)

RLovering commented 4 years ago

thanks for confirming, I was just worried by your comment: If an annotation ahs not been replaced by now with an experiment is should be either I) an annotation priority or ii) a red flag for a potential error... but now I understand what you were saying. I guess it would be interesting to know how many PINCs fall into this category ;)

ValWood commented 4 years ago

It seems like it would be a nice way to get rid of some redundancy and then afterwards either i) flag papers for curation priority OR ii) consider removal of the annotation