Closed davmlaw closed 1 year ago
As the data is incrementally upgraded, rather than have a version - we can just run off the "created" date for the ontology_term_relationships. The only reason we can't do this right now is ontology_term_relationships are hard deleted when they're no longer valid (which does happen). To solve instead of hard deleting, we need a delete_date. Then when we add relationship, if the relationship already exists but is deleted, we just duplicate it. (will have to remove a unique index) -> this is an unlikely scenario but duplicating records will cover it.
Records change not just get added/removed.
Eg I raised an issue that got an alias reassigned from 1 HPO record to another.
I just added a lot of fields and changed HGNCNames to HGNC so please don't add modify that model for a few days thanks
Ahh I didn't think the contents of aliases would affect the analysis, just the relationships. Very happy to leave that code alone for a while, nearly all my usage goes via the convinience methods (snakes, get children etc) so hopefully that minimises the effort required to lift over. (Also I'll be working in another branch for a few days yet)
I used to store the alias, but now only store the final term, so it wouldn't affect an analysis now, unless someone re-matched a patient phenotype text to terms.
But eg having a gene added/removed from association with a term definitely would.
But having a gene assocation added/remove is a GeneTermRelationship record, which is what I was describing above with use of created date and changing to a soft delete with a deleted date.
Critical - needs to be done before next ontology update in 3.1
Just a reminder (since aliases aren't an issue anymore) rather than storing duplicate records, we will be able to:
Then to restrict the terms and relationships to a date we can just check where the date is less than the version date but any delete date is later than the version date).
Might be worth running some benchmarks and seeing how complicated the queries would need to be.
If you write a record for each version, you can partitioned it by version, and the queries are very straight forward, just do a FilteredRelation with a Q, which causes a JOIN ON - and the DB will only retrieve from the single partition
If you have soft deletes, you'd need to go through all versions, then group and order them. Not sure how that would perform with lots of versions
If you have soft deletes, you'd need to go through all versions, then group and order them. Wont it just be selection relationships where created <= date and deleted is null or deleted > date ?
But yeah, if you partition them away, that's not too messy.
We probably only need to version OntologyTermRelationship - as the codes aren't changed. The aliases may but we can just always have them on the latest version
If all else fails, something we could do to retain backwards compatability would be to turn gene/disease associations into a gene list and then just store that.
Something to think about is that we already have something that is like an ontology version, which is OntologyImport
OntologyTermRelation table is made up of data from multiple sources, each of which can be updated separately:
In [1]: from ontology.models import OntologyTermRelation
In [2]: otr_qs = OntologyTermRelation.objects.all()
In [3]: otr_qs.count()
Out[3]: 315202
In [11]: get_field_counts(otr_qs, "from_import__filename")
Out[11]:
{'hp.owl': 164644,
'https://search.thegencc.org/download/action/submissions-export-csv': 9673,
'mondo.json': 54810,
'OMIM_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt': 86075}
OK, have made new OntologyTermRelation model that uses the new Postgres partitions - working on ontology_versioning branch
I need to be able to version OntologyTermRelation so that eg an analysis from 2 years ago retrieves the same genes for a term over time.
The plan is to make OntologyTermRelation use the newer Postgres automatic partitions (partition by LIST) using “from_import” (OntologyImport)
This works fine, except I need to be able to explicitly refer to them, eg GenCC version 1, MONDO version 2, HP version 2 etc. So I’d like to add a “version” and make it unique_together = (“import_source”, “filename”, “version”)
Here’s the list of total OntologyImport records:
x = defaultdict(Counter)
for import_source, filename in OntologyImport.objects.all().values_list("import_source", "filename"):
x[import_source][filename] += 1
defaultdict(collections.Counter,
{'MONDO': Counter({'mondo.json': 2}),
'HGNC': Counter({'HGNC Aliases': 683}),
'biomart': Counter({'mart_export.txt': 1}),
'HP': Counter({'phenotype_to_genes.txt': 2, 'hp.owl': 1}),
'gencc': Counter({'https://search.thegencc.org/download/action/submissions-export-csv': 1}),
'HGNC Sync': Counter({'HGNC Aliases': 1})})
But some things don’t make relations - they only make leafs, eg imports in OntologyTermRelation are:
Counter(OntologyTermRelation.objects.values_list("from_import__import_source", flat=True))
Counter({'gencc': 9673, 'MONDO': 57356, 'HP': 579192})
TODO:
To update the ones that are now versioned:
wget https://search.thegencc.org/download/action/submissions-export-csv --output-document=gencc-submissions.csv
wget http://purl.obolibrary.org/obo/mondo.json
wget https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/hp.owl
wget http://purl.obolibrary.org/obo/hp/hpoa/phenotype_to_genes.txt
then
export ONTOLOGY_DIR=/data/annotation/variantgrid_setup_data/new_ontology
python3 manage.py ontology_import --mondo ${ONTOLOGY_DIR}/mondo.json --hpo ${ONTOLOGY_DIR}/hp.owl --phenotype_to_genes ${ONTOLOGY_DIR}/phenotype_to_genes.txt --gencc ${ONTOLOGY_DIR}/gencc-submissions.csv
James mentioned during code walkthrough that possibly lots of ontology snake moethods could be moved to OntologyVersion
Agrees that snake could be moved into own file
To test the different versions:
from ontology.models import OntologyVersion, OntologyTerm
ov1 = OntologyVersion.objects.first()
ov2 = OntologyVersion.objects.last()
same = 0
diff = 0
diff_set = set()
for i, ot in enumerate(OntologyTerm.objects.all()):
if i % 100 == 0:
print(f"Checked {i} - {same=} vs {len(diff_set)}")
if i == 500:
break
symbols1 = set(ov1.gene_symbols_for_terms([ot]))
symbols2 = set(ov2.gene_symbols_for_terms([ot]))
if symbols1 != symbols2:
diff_set.add(ot)
else:
same += 1
Shariant test: There should be no change in ontology functionality. Eg condition matching, viewing ontology pages etc
Is this a KABOOM situation or should I be looking for subtle changes in condition matching?
James to look into Panelapp not being brought in
PanelApp now being brought in (was an issue originally as it exists outside of versioning since we grab pieces live instead of doing a large static import).
Can be demonstrated now using the example found where PanelApp was missing from the UI (despite even being in the database at the time) https://test.shariant.org.au/ontology/term/HGNC_6743
So yeah, not a kaboom kind of issue, more of a as we remove all these duplicate results, do we accidentally remove stuff we're not meant to - but HPO relationships, MONDO relationships, GENCC relationships are all versioned (OMIM file doesn't contain any relationships, and it's only the relationships that are versioned, not the terms themselves).
At the moment phenotype node looks up ontology to return a list of genes to filter.
This gene list will change over time as the ontology_term table changes.
We'd like to version this so that historical analyses will return consistent results, even as ontology terms are updated regularly.
Current workaround: Don't upgrade ontology terms after deployment so we only have 1 version.
This will make things consistent, and when we finally get around to versioning, there will only be 1 historical version