TranslatorSRI / Babel

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.
MIT License
8 stars 2 forks source link

Validating a new Babel run #42

Open gaurav opened 2 years ago

gaurav commented 2 years ago

I've just completed my first run of Babel on Sterling (on a container with 500GB of memory!) using the changes in draft PR https://github.com/TranslatorSRI/Babel/pull/37. The results I've obtained (on Hatteras at /scratch/gaurav/babel-outputs/2022apr4) has lots of differences from the 2022-01-01 run, but I haven't come up with a good way of summarizing the changes or figuring out if it's working "correctly".

I've tried using diff/diffstat, but there are tons of changes, so it's not easy to see how signficant the changes are. I tried diffing some files individually, and was able to find a few patterns: for example, the polypeptide LSM-37009 in synonyms/Polypeptide.txt is referred to as CHEBI:125504 in the new run and INCHIKEY:GGLDQJNBYFODOM-RDCMKPLUSA-N in the previous run.

Diffstat comparison of Jan 1 and Apr 4 Babel runs ``` compendia/AnatomicalEntity.txt |284873 compendia/BiologicalProcess.txt |55258 compendia/Cell.txt |15690 compendia/CellularComponent.txt |24855 compendia/ChemicalEntity.txt |6976071 compendia/ChemicalMixture.txt | 889 compendia/ComplexMolecularMixture.txt | 296 compendia/Disease.txt |654029 compendia/Gene.txt |77898179 ++--- compendia/GeneFamily.txt |55418 compendia/GrossAnatomicalStructure.txt |20397 compendia/MolecularActivity.txt |294143 compendia/MolecularMixture.txt |16366879 - compendia/OrganismTaxon.txt |4783919 compendia/Pathway.txt |104290 compendia/PhenotypicFeature.txt |700283 compendia/Polypeptide.txt | 753 compendia/Protein.txt |456484451 ++++++++++++++++----------------- compendia/SmallMolecule.txt |204804339 +++++++------- conflation/GeneProtein.txt |16857753 - reports/AnatomicalEntity.txt | 100 reports/BiologicalProcess.txt | 17 reports/Cell.txt | 73 reports/CellularComponent.txt | 60 reports/ChemicalEntity.txt | 1451 reports/ChemicalMixture.txt | 20 reports/ComplexMolecularMixture.txt | 23 reports/Disease.txt | 8278 reports/Gene.txt | 72 reports/GeneFamily.txt | 8 reports/GrossAnatomicalStructure.txt | 80 reports/MolecularActivity.txt | 70 reports/MolecularMixture.txt | 2310 reports/OrganismTaxon.txt | 12 reports/Pathway.txt | 8 reports/PhenotypicFeature.txt | 1175 reports/Polypeptide.txt | 30 reports/Protein.txt | 445 reports/SmallMolecule.txt | 8790 reports/disease_completeness.txt | 69 reports/process_completeness.txt | 4 synonyms/AnatomicalEntity.txt |624380 synonyms/BiologicalProcess.txt |224432 synonyms/Cell.txt |43534 synonyms/CellularComponent.txt |57426 synonyms/ChemicalEntity.txt |1428975 synonyms/ChemicalMixture.txt | 3566 synonyms/ComplexMolecularMixture.txt | 1672 synonyms/Disease.txt |2407347 synonyms/Gene.txt |1060645 synonyms/GeneFamily.txt |55418 synonyms/GrossAnatomicalStructure.txt |105021 synonyms/MolecularActivity.txt |393102 synonyms/MolecularMixture.txt |16811698 - synonyms/OrganismTaxon.txt |139483 synonyms/Pathway.txt |109508 synonyms/PhenotypicFeature.txt |1712740 synonyms/Polypeptide.txt | 3133 synonyms/Protein.txt |2871694 synonyms/SmallMolecule.txt |215593042 +++++++-------- 60 files changed, 525618409 insertions(+), 504434267 deletions(-) ```

Probably the best way to compare the changes is by comparing line counts, which shows that most files are pretty similarly sized, except for compendia/ChemicalEntity.txt (which is 1577.58% bigger), compendia/MolecularMixture.txt (58.43% bigger) and synonyms/MolecularMixture.txt (56.51% bigger).

Does anybody have suggestions for comparing/validating the new Babel output before we try to move it to the dev server? We could for instance dump all the IDs alphabetically and run a massive diff on that. Having some method to do this would help with #36 as well.

  January 1, 2022 April 4, 2022 Percentage change
reports/chemical_completeness.txt 1 1 0.00%
reports/disease_completeness.txt 60 123 105.00%
reports/taxon_done 1 1 0.00%
reports/process_done 1 1 0.00%
reports/ChemicalEntity.txt 741 732 -1.21%
reports/MolecularMixture.txt 1144 1182 3.32%
reports/gene_done 1 1 0.00%
reports/ChemicalMixture.txt 15 17 13.33%
reports/protein_done 1 1 0.00%
reports/anatomy_done 1 1 0.00%
reports/MolecularActivity.txt 39 41 5.13%
reports/Disease.txt 4154 4174 0.48%
reports/OrganismTaxon.txt 11 11 0.00%
reports/Protein.txt 197 274 39.09%
reports/Cell.txt 42 43 2.38%
reports/genefamily_done 1 1 0.00%
reports/CellularComponent.txt 36 40 11.11%
reports/process_completeness.txt 3 1 -66.67%
reports/ComplexMolecularMixture.txt 18 15 -16.67%
reports/taxon_completeness.txt 1 1 0.00%
reports/anatomy_completeness.txt 1 1 0.00%
reports/PhenotypicFeature.txt 603 626 3.81%
reports/GrossAnatomicalStructure.txt 43 45 4.65%
reports/Polypeptide.txt 20 22 10.00%
reports/BiologicalProcess.txt 15 14 -6.67%
reports/disease_done 1 1 0.00%
reports/gene_completeness.txt 1 1 0.00%
reports/Pathway.txt 11 11 0.00%
reports/genefamily_completeness.txt 1 1 0.00%
reports/AnatomicalEntity.txt 52 62 19.23%
reports/SmallMolecule.txt 4384 4432 1.09%
reports/protein_completeness.txt 1 1 0.00%
reports/chemicals_done 1 1 0.00%
reports/GeneFamily.txt 9 9 0.00%
reports/Gene.txt 45 47 4.44%
compendia/ChemicalEntity.txt 392499 6584478 1577.58%
compendia/MolecularMixture.txt 6334558 10035657 58.43%
compendia/ChemicalMixture.txt 475 482 1.47%
compendia/MolecularActivity.txt 145925 149030 2.13%
compendia/Disease.txt 322229 332754 3.27%
compendia/OrganismTaxon.txt 2375027 2412122 1.56%
compendia/Protein.txt 223676217 232834484 4.09%
compendia/Cell.txt 7678 8210 6.93%
compendia/CellularComponent.txt 12510 12623 0.90%
compendia/ComplexMolecularMixture.txt 165 169 2.42%
compendia/PhenotypicFeature.txt 355408 345793 -2.71%
compendia/GrossAnatomicalStructure.txt 10379 10238 -1.36%
compendia/Polypeptide.txt 408 409 0.25%
compendia/BiologicalProcess.txt 27790 27714 -0.27%
compendia/Pathway.txt 52370 52452 0.16%
compendia/AnatomicalEntity.txt 142269 143562 0.91%
compendia/SmallMolecule.txt 104226454 100590131 -3.49%
compendia/GeneFamily.txt 27892 27770 -0.44%
compendia/Gene.txt 37802616 40108195 6.10%
synonyms/ChemicalEntity.txt 698121 732464 4.92%
synonyms/MolecularMixture.txt 6555269 10259687 56.51%
synonyms/ChemicalMixture.txt 1856 1870 0.75%
synonyms/MolecularActivity.txt 195416 198534 1.60%
synonyms/Disease.txt 1189024 1219429 2.56%
synonyms/OrganismTaxon.txt 69926 69993 0.10%
synonyms/Protein.txt 1421157 1451959 2.17%
synonyms/Cell.txt 20674 23034 11.42%
synonyms/CellularComponent.txt 28577 29027 1.57%
synonyms/ComplexMolecularMixture.txt 878 890 1.37%
synonyms/PhenotypicFeature.txt 858920 855136 -0.44%
synonyms/GrossAnatomicalStructure.txt 52860 52553 -0.58%
synonyms/Polypeptide.txt 1641 1628 -0.79%
synonyms/BiologicalProcess.txt 112432 112364 -0.06%
synonyms/Pathway.txt 54941 55021 0.15%
synonyms/AnatomicalEntity.txt 309911 315239 1.72%
synonyms/SmallMolecule.txt 108292041 107313775 -0.90%
synonyms/GeneFamily.txt 27892 27770 -0.44%
synonyms/Gene.txt 497027 564344 13.54%
conflation/GeneProtein.txt 8168582 8692887 6.42%
cbizon commented 2 years ago

Yes, this is a question I have struggled with.

Looking at the sizes of the compendia themselves is a good way to get some idea what's going on. The big changes in ChemicalEntity and MolecularMixture are both concerning. To me, that suggests that both are not merging on something, so we have more smaller cliques than fewer larger ones. And of course, the new way may be better; it will require a bit of digging into.

The stuff in reports is meant as a partial answer to this problem.

The reports that are *done are not interesting, they are just artifacts for snakemake to know that the process has completed. Probably they should be written into another place? Maybe they can be dispensed with entirely with a bit of thought.

The _completeness files are there to check for a particular kind of error: making sure that we're not losing identifiers. At times past we have sometimes had input identifiers that didn't end up in any cliques. So these are lists of identifiers that didn't end up in cliques. Really, you want all these to just have the header, and you don't want them to increase in size. We previously have had a long running issue in disease: https://github.com/TranslatorSRI/NodeNormalization/issues/64 . It looks like that one got worse; if it's a question of the same cause then so be it but maybe something else has changed.

The most useful (?) files are the ones called e.g. reports/MolecularMixture.txt. First you get a cluster size distribution. It's often good to review the biggest cluster and see if anything can be done about them. Usually if there's not a bulge at the end of the distribution though, there's not much that can be done. The other useful thing is the prefix list. So for instance in that MM.txt, you see this:

frozenset({('UNII', 2), ('CAS', 1), ('PUBCHEM.COMPOUND', 2), ('INCHIKEY', 1)})  1
frozenset({('MESH', 1), ('PUBCHEM.COMPOUND', 1), ('UNII', 2), ('CHEMBL.COMPOUND', 1), ('UMLS', 3), ('INCHIKEY', 1)})    1
frozenset({('MESH', 1), ('PUBCHEM.COMPOUND', 1), ('HMDB', 1), ('UNII', 1), ('CHEBI', 1), ('CHEMBL.COMPOUND', 1), ('INCHIKEY', 1)})      1
...
frozenset({('INCHIKEY', 1), ('PUBCHEM.COMPOUND', 1), ('CHEMBL.COMPOUND', 1)})   87120
frozenset({('CAS', 1), ('PUBCHEM.COMPOUND', 1), ('INCHIKEY', 1)})       196331
frozenset({('INCHIKEY', 1), ('PUBCHEM.COMPOUND', 1)})   5863401

So the most common clique/cluster consists of a pubchem.compound id and an inchikey, and there are 5.8 M like that. The next biggest group is that plus a CAS, with almost 200k, and so on down to clusters that only happen once, like some compund that has a mesh, a pubchem, an hmdb, a unii, a chebi, a chembl, and an inchikey.

So to figure out what's going on with the big % change in the compendia, I would look at the differences in these reports and see what is getting merged/split differently. With those big % changes I think they'll be easy to find? And then it's a matter of digging around and looking at examples and seeing if it's an improvement or not.

I've made half hearted attempts to compare directly two compendium runs to see what changed, but have never found anything I liked very much. Happy to try other ideas though.

gaurav commented 1 year ago

I've gotten some working diff-Babel-runs code that took around 9.5 hours (and over 500GB of memory!) to run a comparison on the entire Babel run between Jan 1 and May 21, 2022.

This comparison works by making a list of all the identifiers used in either the previous version of the compendium or the current version of the compendium -- for example, OrganismTaxon.txt has 2,494,604 such identifiers. For each identifier, it then collects all the cliques/clusters from the previous compendium and the current compendium, and assigns each identifier to one of a few general categories (detailed in https://github.com/gaurav/babel-validation/pull/5) -- for example, DELETED means that the identifier was present in one or more clusters/cliques in the previous compendium and not present in new compendium. My plan is to come up with additional ways of clustering identifiers based on measures that are useful to determine if we've lost data or started overly grouping clusters/cliques.

Some highlights:

Also some practical notes:

The full lists of all cluster diffs by identifier are on Hatteras at /scratch/gaurav/babel-outputs/2022may21/babel_outputs/reports/diff_to_2022jan1.

cbizon commented 1 year ago

This is excellent progress. I wonder if we should be versioning the intermediate products. For instance, in the GrossAnatomicalStructure stuff, if we had the previous relationship files, we could see that e.g. the one built off of source X is suddenly smaller...

I would like for us to target having a dev release of this build by the end of August.

gaurav commented 1 year ago

This is excellent progress. I wonder if we should be versioning the intermediate products. For instance, in the GrossAnatomicalStructure stuff, if we had the previous relationship files, we could see that e.g. the one built off of source X is suddenly smaller...

Implemented in PR https://github.com/TranslatorSRI/Babel/pull/57.