Validating a new Babel run

gaurav commented 2 years ago

I've just completed my first run of Babel on Sterling (on a container with 500GB of memory!) using the changes in draft PR https://github.com/TranslatorSRI/Babel/pull/37. The results I've obtained (on Hatteras at /scratch/gaurav/babel-outputs/2022apr4) has lots of differences from the 2022-01-01 run, but I haven't come up with a good way of summarizing the changes or figuring out if it's working "correctly".

I've tried using diff/diffstat, but there are tons of changes, so it's not easy to see how signficant the changes are. I tried diffing some files individually, and was able to find a few patterns: for example, the polypeptide LSM-37009 in synonyms/Polypeptide.txt is referred to as CHEBI:125504 in the new run and INCHIKEY:GGLDQJNBYFODOM-RDCMKPLUSA-N in the previous run.

Diffstat comparison of Jan 1 and Apr 4 Babel runs

Probably the best way to compare the changes is by comparing line counts, which shows that most files are pretty similarly sized, except for compendia/ChemicalEntity.txt (which is 1577.58% bigger), compendia/MolecularMixture.txt (58.43% bigger) and synonyms/MolecularMixture.txt (56.51% bigger).

Does anybody have suggestions for comparing/validating the new Babel output before we try to move it to the dev server? We could for instance dump all the IDs alphabetically and run a massive diff on that. Having some method to do this would help with #36 as well.

	January 1, 2022	April 4, 2022	Percentage change
reports/chemical_completeness.txt	1	1	0.00%
reports/disease_completeness.txt	60	123	105.00%
reports/taxon_done	1	1	0.00%
reports/process_done	1	1	0.00%
reports/ChemicalEntity.txt	741	732	-1.21%
reports/MolecularMixture.txt	1144	1182	3.32%
reports/gene_done	1	1	0.00%
reports/ChemicalMixture.txt	15	17	13.33%
reports/protein_done	1	1	0.00%
reports/anatomy_done	1	1	0.00%
reports/MolecularActivity.txt	39	41	5.13%
reports/Disease.txt	4154	4174	0.48%
reports/OrganismTaxon.txt	11	11	0.00%
reports/Protein.txt	197	274	39.09%
reports/Cell.txt	42	43	2.38%
reports/genefamily_done	1	1	0.00%
reports/CellularComponent.txt	36	40	11.11%
reports/process_completeness.txt	3	1	-66.67%
reports/ComplexMolecularMixture.txt	18	15	-16.67%
reports/taxon_completeness.txt	1	1	0.00%
reports/anatomy_completeness.txt	1	1	0.00%
reports/PhenotypicFeature.txt	603	626	3.81%
reports/GrossAnatomicalStructure.txt	43	45	4.65%
reports/Polypeptide.txt	20	22	10.00%
reports/BiologicalProcess.txt	15	14	-6.67%
reports/disease_done	1	1	0.00%
reports/gene_completeness.txt	1	1	0.00%
reports/Pathway.txt	11	11	0.00%
reports/genefamily_completeness.txt	1	1	0.00%
reports/AnatomicalEntity.txt	52	62	19.23%
reports/SmallMolecule.txt	4384	4432	1.09%
reports/protein_completeness.txt	1	1	0.00%
reports/chemicals_done	1	1	0.00%
reports/GeneFamily.txt	9	9	0.00%
reports/Gene.txt	45	47	4.44%
compendia/ChemicalEntity.txt	392499	6584478	1577.58%
compendia/MolecularMixture.txt	6334558	10035657	58.43%
compendia/ChemicalMixture.txt	475	482	1.47%
compendia/MolecularActivity.txt	145925	149030	2.13%
compendia/Disease.txt	322229	332754	3.27%
compendia/OrganismTaxon.txt	2375027	2412122	1.56%
compendia/Protein.txt	223676217	232834484	4.09%
compendia/Cell.txt	7678	8210	6.93%
compendia/CellularComponent.txt	12510	12623	0.90%
compendia/ComplexMolecularMixture.txt	165	169	2.42%
compendia/PhenotypicFeature.txt	355408	345793	-2.71%
compendia/GrossAnatomicalStructure.txt	10379	10238	-1.36%
compendia/Polypeptide.txt	408	409	0.25%
compendia/BiologicalProcess.txt	27790	27714	-0.27%
compendia/Pathway.txt	52370	52452	0.16%
compendia/AnatomicalEntity.txt	142269	143562	0.91%
compendia/SmallMolecule.txt	104226454	100590131	-3.49%
compendia/GeneFamily.txt	27892	27770	-0.44%
compendia/Gene.txt	37802616	40108195	6.10%
synonyms/ChemicalEntity.txt	698121	732464	4.92%
synonyms/MolecularMixture.txt	6555269	10259687	56.51%
synonyms/ChemicalMixture.txt	1856	1870	0.75%
synonyms/MolecularActivity.txt	195416	198534	1.60%
synonyms/Disease.txt	1189024	1219429	2.56%
synonyms/OrganismTaxon.txt	69926	69993	0.10%
synonyms/Protein.txt	1421157	1451959	2.17%
synonyms/Cell.txt	20674	23034	11.42%
synonyms/CellularComponent.txt	28577	29027	1.57%
synonyms/ComplexMolecularMixture.txt	878	890	1.37%
synonyms/PhenotypicFeature.txt	858920	855136	-0.44%
synonyms/GrossAnatomicalStructure.txt	52860	52553	-0.58%
synonyms/Polypeptide.txt	1641	1628	-0.79%
synonyms/BiologicalProcess.txt	112432	112364	-0.06%
synonyms/Pathway.txt	54941	55021	0.15%
synonyms/AnatomicalEntity.txt	309911	315239	1.72%
synonyms/SmallMolecule.txt	108292041	107313775	-0.90%
synonyms/GeneFamily.txt	27892	27770	-0.44%
synonyms/Gene.txt	497027	564344	13.54%
conflation/GeneProtein.txt	8168582	8692887	6.42%

cbizon commented 2 years ago

Yes, this is a question I have struggled with.

Looking at the sizes of the compendia themselves is a good way to get some idea what's going on. The big changes in ChemicalEntity and MolecularMixture are both concerning. To me, that suggests that both are not merging on something, so we have more smaller cliques than fewer larger ones. And of course, the new way may be better; it will require a bit of digging into.

The stuff in reports is meant as a partial answer to this problem.

The reports that are *done are not interesting, they are just artifacts for snakemake to know that the process has completed. Probably they should be written into another place? Maybe they can be dispensed with entirely with a bit of thought.

The _completeness files are there to check for a particular kind of error: making sure that we're not losing identifiers. At times past we have sometimes had input identifiers that didn't end up in any cliques. So these are lists of identifiers that didn't end up in cliques. Really, you want all these to just have the header, and you don't want them to increase in size. We previously have had a long running issue in disease: https://github.com/TranslatorSRI/NodeNormalization/issues/64 . It looks like that one got worse; if it's a question of the same cause then so be it but maybe something else has changed.

The most useful (?) files are the ones called e.g. reports/MolecularMixture.txt. First you get a cluster size distribution. It's often good to review the biggest cluster and see if anything can be done about them. Usually if there's not a bulge at the end of the distribution though, there's not much that can be done. The other useful thing is the prefix list. So for instance in that MM.txt, you see this:

frozenset({('UNII', 2), ('CAS', 1), ('PUBCHEM.COMPOUND', 2), ('INCHIKEY', 1)})  1
frozenset({('MESH', 1), ('PUBCHEM.COMPOUND', 1), ('UNII', 2), ('CHEMBL.COMPOUND', 1), ('UMLS', 3), ('INCHIKEY', 1)})    1
frozenset({('MESH', 1), ('PUBCHEM.COMPOUND', 1), ('HMDB', 1), ('UNII', 1), ('CHEBI', 1), ('CHEMBL.COMPOUND', 1), ('INCHIKEY', 1)})      1
...
frozenset({('INCHIKEY', 1), ('PUBCHEM.COMPOUND', 1), ('CHEMBL.COMPOUND', 1)})   87120
frozenset({('CAS', 1), ('PUBCHEM.COMPOUND', 1), ('INCHIKEY', 1)})       196331
frozenset({('INCHIKEY', 1), ('PUBCHEM.COMPOUND', 1)})   5863401

So the most common clique/cluster consists of a pubchem.compound id and an inchikey, and there are 5.8 M like that. The next biggest group is that plus a CAS, with almost 200k, and so on down to clusters that only happen once, like some compund that has a mesh, a pubchem, an hmdb, a unii, a chebi, a chembl, and an inchikey.

So to figure out what's going on with the big % change in the compendia, I would look at the differences in these reports and see what is getting merged/split differently. With those big % changes I think they'll be easy to find? And then it's a matter of digging around and looking at examples and seeing if it's an improvement or not.

I've made half hearted attempts to compare directly two compendium runs to see what changed, but have never found anything I liked very much. Happy to try other ideas though.

gaurav commented 1 year ago

I've gotten some working diff-Babel-runs code that took around 9.5 hours (and over 500GB of memory!) to run a comparison on the entire Babel run between Jan 1 and May 21, 2022.

This comparison works by making a list of all the identifiers used in either the previous version of the compendium or the current version of the compendium -- for example, OrganismTaxon.txt has 2,494,604 such identifiers. For each identifier, it then collects all the cliques/clusters from the previous compendium and the current compendium, and assigns each identifier to one of a few general categories (detailed in https://github.com/gaurav/babel-validation/pull/5) -- for example, DELETED means that the identifier was present in one or more clusters/cliques in the previous compendium and not present in new compendium. My plan is to come up with additional ways of clustering identifiers based on measures that are useful to determine if we've lost data or started overly grouping clusters/cliques.

Some highlights:

All but seven compendia have over 80% unchanged identifiers, which is what we'd expect to see. In some cases, such as , this seems to be because there are a lot of new identifiers (ChemicalEntity.txt: 12,449,396 or 95.97% of the identifiers from prev and current compendia, and Polypeptide.txt has 362 or 35.84% of the identifier union).
SmallMolecule.txt (55.68%), MolecularMixture.txt (37.35%), Polypeptide.txt (14.06%), ComplexMolecularMixture.txt (7.81%) and Protein.txt (5.95%) all have high percentages of "changed with identical identifiers", i.e. where an identifier has the same set of identifiers in both previous and current compendia. Unless I've got a bug in that part of my code, this should only happen when the labels on identifiers change, which is likely just a source database being improved.
Wherever identifiers have been deleted, it comprises less than 6% of the combined identifier count except for GrossAnatomicalStructure.txt (1630 or 10.91%). There's no obvious reasons why so many identifiers has been deleted here -- they point to a variety of sources (mostly NCIT, some UMLS, MESH, UBERON) and the labels for these identifiers appear to be correct (see /scratch/gaurav/babel-outputs/2022may21/babel_outputs/reports/diff_to_2022jan1/GrossAnatomicalStructure.txt for all the changes grouped by change type), so this looks to me like we're losing information for some reason. I wonder if maybe Uberon or another data source removed lots of cross-references to NCIT and some other data sources a little over enthusiastically.
The following compendia have over 5% of combined identifiers changed for no clear reason: , Disease.txt (9.18%), ComplexMolecularMixture.txt (5.71%), GrossAnatomicalStructure.txt (13.02%), Cell.txt (5.57%), MolecularActivity.txt (4.28%) and Gene.txt (3.84%). This is probably too many to look through manually and figure out any kinds of patterns, so I'll try to look at them and figure out if I can find any patterns of interest there -- for example, some identifiers (such as ORPHANET:2055) are being removed from a clique of 6 IDs into its own identifiers, which seems... not great. I'll make a separate category for some of those patterns and then will rerun my diffing tool.

Also some practical notes:

OrganismTaxon.txt and smaller (2,494,604 identifiers between the two runs) files run pretty quickly (< 1 hour, I think), but the larger files take up most of the remaining 8.5 hours.
We don't currently report compendia not present in both runs, which isn't an issue at the moment, but will be in the future once we integrate unused UMLS identifiers into Babel.
This currently reports "Changed" when the "information content" (IC) of a cluster/clique changes, but doesn't do any additional grouping based on that.
This doesn't include any kind of validation on either the conflation or the synonyms, but those might be easier to validate by just comparing the size of the files as I did in a previous comment? I'll look into this in more detail.

The full lists of all cluster diffs by identifier are on Hatteras at /scratch/gaurav/babel-outputs/2022may21/babel_outputs/reports/diff_to_2022jan1.

cbizon commented 1 year ago

This is excellent progress. I wonder if we should be versioning the intermediate products. For instance, in the GrossAnatomicalStructure stuff, if we had the previous relationship files, we could see that e.g. the one built off of source X is suddenly smaller...

I would like for us to target having a dev release of this build by the end of August.

gaurav commented 1 year ago

This is excellent progress. I wonder if we should be versioning the intermediate products. For instance, in the GrossAnatomicalStructure stuff, if we had the previous relationship files, we could see that e.g. the one built off of source X is suddenly smaller...

Implemented in PR https://github.com/TranslatorSRI/Babel/pull/57.

TranslatorSRI / Babel

Validating a new Babel run #42