generate reports on differences in two gpad2.0 files

biolink / ontobio

python library for working with ontologies and ontology associations

https://ontobio.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

118 stars 30 forks source link

generate reports on differences in two gpad2.0 files #540

Open sierra-moxon opened 3 years ago

sierra-moxon commented 3 years ago

[ ] command line tool that takes two gpad files and produces a "diff" report between them.
[ ] high level summary statistics generated, ie: N genes had M new annotations.

kltm commented 3 years ago

Tagging @ukemi @vanaukenk

ukemi commented 3 years ago

But realize that annotations_in != annotations_out for all groups. In some cases, incoming annotations will be split or deepened depending on the final procedure for creating GPADs from Noctua. For example: If MGI has an annotation to organ development_results in development of lung, I believe currently this will be deepened to lung development. If an MGI annotation has two pipe-delimited extensions, it will be split into two separate annotations. We need to talk about what to do with pipe-delimited 'with' fields. A better comparison, but way harder to do would be to be sure that the incoming GPAD file is semantically equivalent to the outgoing GPAD.

sierra-moxon commented 3 years ago

Note another gpaddiff tool was developed: https://github.com/geneontology/gocamgen/tree/master/gpaddiff (thanks for the pointer @dustine32!)

sierra-moxon commented 2 years ago

The current iteration of this tool, compares at the file level and attempts to compare at the semantic annotation level as well. It will be good to go over the results in an import meeting so we can see if its on the right track! :)

kltm commented 2 years ago

@sierra-moxon This is really coming along! I'm running it again and having a bit of fun.

Minor question: one of the group_by_column arguments is "evidence_code"; it this actually mapping back to evidence codes, or is it just evidence (which I think might make more sense in a world with GPADs)? I think "subject" and "object" might be a bit odd for more casual users, both as input parameters and for output output. I might suggest GO term / bioentity or similar.

I would also advocate for a "cli" or "machine" output mode for those interested in using the results in automated processes (raises hand) and quick exploration of differences. It would be more actual results and less "reporting" (the counts report may be usable for this), so it would be easier to pipeline into grep or jq; it would also be nice to select one of the outputs for STDOUT (fitting with a lot of what I do).