SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

Ontology versioning #206

Closed davmlaw closed 1 year ago

davmlaw commented 3 years ago

At the moment phenotype node looks up ontology to return a list of genes to filter.

This gene list will change over time as the ontology_term table changes.

We'd like to version this so that historical analyses will return consistent results, even as ontology terms are updated regularly.

Current workaround: Don't upgrade ontology terms after deployment so we only have 1 version.

This will make things consistent, and when we finally get around to versioning, there will only be 1 historical version

TheMadBug commented 3 years ago

As the data is incrementally upgraded, rather than have a version - we can just run off the "created" date for the ontology_term_relationships. The only reason we can't do this right now is ontology_term_relationships are hard deleted when they're no longer valid (which does happen). To solve instead of hard deleting, we need a delete_date. Then when we add relationship, if the relationship already exists but is deleted, we just duplicate it. (will have to remove a unique index) -> this is an unlikely scenario but duplicating records will cover it.

davmlaw commented 3 years ago

Records change not just get added/removed.

Eg I raised an issue that got an alias reassigned from 1 HPO record to another.

I just added a lot of fields and changed HGNCNames to HGNC so please don't add modify that model for a few days thanks

TheMadBug commented 3 years ago

Ahh I didn't think the contents of aliases would affect the analysis, just the relationships. Very happy to leave that code alone for a while, nearly all my usage goes via the convinience methods (snakes, get children etc) so hopefully that minimises the effort required to lift over. (Also I'll be working in another branch for a few days yet)

davmlaw commented 3 years ago

I used to store the alias, but now only store the final term, so it wouldn't affect an analysis now, unless someone re-matched a patient phenotype text to terms.

But eg having a gene added/removed from association with a term definitely would.

TheMadBug commented 3 years ago

But having a gene assocation added/remove is a GeneTermRelationship record, which is what I was describing above with use of created date and changing to a soft delete with a deleted date.

davmlaw commented 3 years ago

Critical - needs to be done before next ontology update in 3.1

TheMadBug commented 3 years ago

Just a reminder (since aliases aren't an issue anymore) rather than storing duplicate records, we will be able to:

Then to restrict the terms and relationships to a date we can just check where the date is less than the version date but any delete date is later than the version date).

davmlaw commented 3 years ago

Might be worth running some benchmarks and seeing how complicated the queries would need to be.

If you write a record for each version, you can partitioned it by version, and the queries are very straight forward, just do a FilteredRelation with a Q, which causes a JOIN ON - and the DB will only retrieve from the single partition

If you have soft deletes, you'd need to go through all versions, then group and order them. Not sure how that would perform with lots of versions

TheMadBug commented 3 years ago

If you have soft deletes, you'd need to go through all versions, then group and order them. Wont it just be selection relationships where created <= date and deleted is null or deleted > date ?

But yeah, if you partition them away, that's not too messy.

davmlaw commented 2 years ago

We probably only need to version OntologyTermRelationship - as the codes aren't changed. The aliases may but we can just always have them on the latest version

If all else fails, something we could do to retain backwards compatability would be to turn gene/disease associations into a gene list and then just store that.

davmlaw commented 2 years ago

Something to think about is that we already have something that is like an ontology version, which is OntologyImport

OntologyTermRelation table is made up of data from multiple sources, each of which can be updated separately:

In [1]: from ontology.models import OntologyTermRelation                                                                     
In [2]: otr_qs = OntologyTermRelation.objects.all()                                                                          
In [3]: otr_qs.count()                                                                                                       
Out[3]: 315202
In [11]: get_field_counts(otr_qs, "from_import__filename")                                                                   
Out[11]: 
{'hp.owl': 164644,
 'https://search.thegencc.org/download/action/submissions-export-csv': 9673,
 'mondo.json': 54810,
 'OMIM_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt': 86075}
davmlaw commented 2 years ago

OK, have made new OntologyTermRelation model that uses the new Postgres partitions - working on ontology_versioning branch

I need to be able to version OntologyTermRelation so that eg an analysis from 2 years ago retrieves the same genes for a term over time.

The plan is to make OntologyTermRelation use the newer Postgres automatic partitions (partition by LIST) using “from_import” (OntologyImport)

This works fine, except I need to be able to explicitly refer to them, eg GenCC version 1, MONDO version 2, HP version 2 etc. So I’d like to add a “version” and make it unique_together = (“import_source”, “filename”, “version”)

Here’s the list of total OntologyImport records:

x = defaultdict(Counter)
for import_source, filename in OntologyImport.objects.all().values_list("import_source", "filename"):
    x[import_source][filename] += 1

defaultdict(collections.Counter,
            {'MONDO': Counter({'mondo.json': 2}),
             'HGNC': Counter({'HGNC Aliases': 683}),
             'biomart': Counter({'mart_export.txt': 1}),
             'HP': Counter({'phenotype_to_genes.txt': 2, 'hp.owl': 1}),
             'gencc': Counter({'https://search.thegencc.org/download/action/submissions-export-csv': 1}),
             'HGNC Sync': Counter({'HGNC Aliases': 1})})

But some things don’t make relations - they only make leafs, eg imports in OntologyTermRelation are:

Counter(OntologyTermRelation.objects.values_list("from_import__import_source", flat=True))
Counter({'gencc': 9673, 'MONDO': 57356, 'HP': 579192})

TODO:

davmlaw commented 2 years ago

To update the ones that are now versioned:

wget https://search.thegencc.org/download/action/submissions-export-csv --output-document=gencc-submissions.csv
wget http://purl.obolibrary.org/obo/mondo.json
wget https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/hp.owl
wget http://purl.obolibrary.org/obo/hp/hpoa/phenotype_to_genes.txt

then

export ONTOLOGY_DIR=/data/annotation/variantgrid_setup_data/new_ontology

python3 manage.py ontology_import --mondo ${ONTOLOGY_DIR}/mondo.json --hpo ${ONTOLOGY_DIR}/hp.owl --phenotype_to_genes ${ONTOLOGY_DIR}/phenotype_to_genes.txt --gencc ${ONTOLOGY_DIR}/gencc-submissions.csv
davmlaw commented 2 years ago

James mentioned during code walkthrough that possibly lots of ontology snake moethods could be moved to OntologyVersion

Agrees that snake could be moved into own file

To test the different versions:

from ontology.models import OntologyVersion, OntologyTerm
ov1 = OntologyVersion.objects.first()
ov2 = OntologyVersion.objects.last()
same = 0
diff = 0
diff_set = set()
for i, ot in enumerate(OntologyTerm.objects.all()):
    if i % 100 == 0:
        print(f"Checked {i} - {same=} vs {len(diff_set)}")
        if i == 500:
            break

    symbols1 = set(ov1.gene_symbols_for_terms([ot]))
    symbols2 = set(ov2.gene_symbols_for_terms([ot]))
    if symbols1 != symbols2:
        diff_set.add(ot)
    else:
        same += 1
davmlaw commented 2 years ago

Shariant test: There should be no change in ontology functionality. Eg condition matching, viewing ontology pages etc

EmmaTudini commented 2 years ago

Is this a KABOOM situation or should I be looking for subtle changes in condition matching?

EmmaTudini commented 2 years ago

James to look into Panelapp not being brought in

TheMadBug commented 2 years ago

PanelApp now being brought in (was an issue originally as it exists outside of versioning since we grab pieces live instead of doing a large static import).

Can be demonstrated now using the example found where PanelApp was missing from the UI (despite even being in the database at the time) https://test.shariant.org.au/ontology/term/HGNC_6743

So yeah, not a kaboom kind of issue, more of a as we remove all these duplicate results, do we accidentally remove stuff we're not meant to - but HPO relationships, MONDO relationships, GENCC relationships are all versioned (OMIM file doesn't contain any relationships, and it's only the relationships that are versioned, not the terms themselves).