geneontology / minerva

BSD 3-Clause "New" or "Revised" License
6 stars 8 forks source link

Filter the Noctua GPAD outputs for 'duplicate' annotations (new tool) #424

Open ukemi opened 2 years ago

ukemi commented 2 years ago

As part of the noctua annotation import process at MGI, I have asked @loricorbani to turn off the step in our pipeline that filters 'duplicate' annotations. One reason for this is that all of the annotations coming from MGI should be retained during the first round-trip and presumably nothing should be detected as a duplicate. However, as we move forward, I think it will possible to generate annotations from GO-CAM causal models that are essentially duplicates because in some cases MF annotations are used in more than one pathway/model and we may decide to cull some of the annotation extension data that is currently being emitted. The responsibility of representing this information in a more concise manner should fall on the GOC pipeline since eventually, the GOC will be the primary source of these annotations and I assume they will not be picked up from MGI once the switchover to Noctua is complete. I don't think it is too early to start thinking about this issue.

balhoff commented 2 years ago

Just want to note that if the duplicates are from different models, then the fix for this would probably be at the pipeline aggregation level rather than within Minerva.

ukemi commented 2 years ago

This makes sense to me, assuming that when we define a 'duplicate' it does not include the properties about model identity. Would it makes sense for us to take a look at the GPADs generated now and determine which columns and properties we are going to use to define a duplicate?

ukemi commented 2 years ago

Of course, this means that if we do filter the 'duplicate' annotations from different models, downstream resources will not be able to use the model-ids to point to models because the information will be incomplete in the GPAD file. Maybe this is ok, but then should we also filter out the model-id property from the GPAD line?

ukemi commented 2 years ago

Another strategy other than filtering would be to consolidate annotations based on different fields in the file. So for example, multiple models would simply become a single line in the GPAD file with multiple model properties. Presumably they will all be status production.

MGI MGI:2685011 involved_in GO:0009653 PMID:26258302 ECO:0000315 MGI:MGI:4867020 20160729 MGI part_of(GO:0003183),results_in_morphogenesis_of(UBERON:0007151) contributor=https://orcid.org/0000-0001-7476-6306|model-state=production|noctua-model-id=gomodel:56aac7ad00000038|noctua-model-id=secondmodelwiththisdata

ukemi commented 2 years ago

An additional note. Since the output models of the Noctua input currently result in the creation of many more lines of data that were originally based on a single annotation, this will also have an effect on the Alliance GO summary pages for a gene. There will be many lines of data in the Alliance web pages that will look identical if they don't display all of the data in an annotation. ping @cmungall @cindyJax

vanaukenk commented 2 years ago

From 2021-11-15 Noctua imports call:

Implement functionality to the GPAD generation script (or write a new script) to re-compact the Noctua GPAD export files by concatenating fields that have unique entries where otherwise the fields are identical.

Fields to concatenate:

vanaukenk commented 2 years ago

Just noting here that this ticket could/should be part of a larger future GO project on defining and dealing with annotation redundancy.