The-Sequence-Ontology / MSO

Molecular Sequence Ontology
9 stars 5 forks source link

Derived SO (SO_refactored) differs from current release SO in various ways #13

Open cmungall opened 5 years ago

cmungall commented 5 years ago

I assume SO_refactored.owl/obo is the output of the compilation step (see #12)

If so, then it differs substantially from the current released version of SO. Some of these are technical and probably easy to fix; e.g. missing axiom annotations on synonyms. Others involve large changes to the hierarchy. Are all of these intentional, or are some bugs? What is the process for evaluating this and announcing any changes to the community?

cmungall commented 5 years ago

Example:

image

Also weird stuff, the definitions differ but the definition source is the same - surely if a definition is changed substantially then the definition source must change

cmungall commented 5 years ago

OK, looking only at logical axioms, I see >1k classes have at least one logical axiom changed. Some of these are not meaningful and may represent redundancies on one or other of the hierarchies, but the majority seem meaningful, see attached

diff.txt

cmungall commented 5 years ago

Note a number of IDs seem to have disappeared in SO_refactored, but are still present in MSO?

-id: SO:0000054 ! aneuploid
-id: SO:0000055 ! hyperploid
-id: SO:0000056 ! hypoploid
-id: SO:0000359 ! floxed
-id: SO:0000443 ! polymer attribute
-id: SO:0000628 ! chromosomal structural element
-id: SO:0000687 ! deletion junction
-id: SO:0000733 ! feature attribute
-id: SO:0000782 ! natural
-id: SO:0000784 ! foreign
-id: SO:0000814 ! rescue
-id: SO:0000817 ! wild type
-id: SO:0000831 ! gene member region
-id: SO:0000856 ! conserved
-id: SO:0000857 ! homologous
-id: SO:0000858 ! orthologous
-id: SO:0000859 ! paralogous
-id: SO:0000860 ! syntenic
-id: SO:0000976 ! cryptic
-id: SO:0001004 ! low complexity
-id: SO:0001079 ! polypeptide structural motif
-id: SO:0001234 ! mobile
-id: SO:0001409 ! biomaterial region
-id: SO:0001410 ! experimental feature
-id: SO:0001411 ! biological region
-id: SO:0001412 ! topologically defined region
-id: SO:0001761 ! variant quality
-id: SO:0001769 ! variant phenotype --> variant defined by phenotype
-id: SO:0001814 ! coding variant quality
-id: SO:0001815 ! synonymous
-id: SO:0001816 ! non synonymous
-id: SO:0001992 ! nonsynonymous variant
-id: SO:0100001 ! biochemical region of peptide
-id: SO:0100017 ! polypeptide conserved motif
-id: SO:1000160 ! unoriented insertional duplication --> insertional duplication of unspecified orientation
mikebada commented 5 years ago

Re allele, the logical axioms in the refactored MSO/SO were created to properly place the allele class in the new upper-level structuring. I'd argue that asserting allele as a subclass of variant_collection (as it is in the current public SO) is erroneous--even the original natural language definition asserts that it's one of a set of coexisting sequence variants of a gene.

That being said, we do realize that there's more work still to be done, including automatically fixing IDs, synonyms, and natural-language definitions, about which I'm talking with @msinclair2. I'm also going to start another manual review this week to look for errors to be manually fixed.

msinclair2 commented 5 years ago

All the ID annotations have been automatically fixed.

mikebada commented 5 years ago

Re classes in the MSO but not SO_refactored: Most of those listed above are SDCs. As I understand the BFO, DCs can't bear DCs themselves, so there's no need for these SDCs in the SO, as the SO sequence entity classes are GDCs.

There are a few sequence entity classes that I recommended for obsoletion, as I thought they were difficult to situate in the refactored upper-level structuring, which I've discussed with Karen and Michael. There may be a few other classes that were inadvertently dropped.

cmungall commented 5 years ago

By SDC you mean the quality branch? I can follow your reasoning from a strict bfo POV but you can't just toss out classes in use by many in the SO community because BFO.

On Tue, Feb 26, 2019, 02:19 mikebada notifications@github.com wrote:

Re classes in the MSO but not SO_refactored: Most of those listed above are SDCs. As I understand the BFO, DCs can't bear DCs themselves, so there's no need for these SDCs in the SO, as the SO sequence entity classes are GDCs.

There are a few sequence entity classes that I recommended for obsoletion, as I thought they were difficult to situate in the refactored upper-level structuring, which I've discussed with Karen and Michael. There may be a few other classes that were inadvertently dropped.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/The-Sequence-Ontology/MSO/issues/13#issuecomment-467381618, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGOU56oIIjhy4bfdt0Crfz1PMT1ntjks5vRQoSgaJpZM4bI-f4 .

mikebada commented 5 years ago

Yes, I meant the qualities and realizable entities. I've talked about these with Karen a while ago, who as I recall said that these were mostly created for the formal definitions and not used by the sequence annotators. That being said, another thing on the set of tasks still to do was to make sure that all classes that have been actually used in annotations are accounted for.