Open gaurav opened 2 years ago
Yes, this is a question I have struggled with.
Looking at the sizes of the compendia themselves is a good way to get some idea what's going on. The big changes in ChemicalEntity and MolecularMixture are both concerning. To me, that suggests that both are not merging on something, so we have more smaller cliques than fewer larger ones. And of course, the new way may be better; it will require a bit of digging into.
The stuff in reports is meant as a partial answer to this problem.
The reports that are *done are not interesting, they are just artifacts for snakemake to know that the process has completed. Probably they should be written into another place? Maybe they can be dispensed with entirely with a bit of thought.
The _completeness files are there to check for a particular kind of error: making sure that we're not losing identifiers. At times past we have sometimes had input identifiers that didn't end up in any cliques. So these are lists of identifiers that didn't end up in cliques. Really, you want all these to just have the header, and you don't want them to increase in size. We previously have had a long running issue in disease: https://github.com/TranslatorSRI/NodeNormalization/issues/64 . It looks like that one got worse; if it's a question of the same cause then so be it but maybe something else has changed.
The most useful (?) files are the ones called e.g. reports/MolecularMixture.txt. First you get a cluster size distribution. It's often good to review the biggest cluster and see if anything can be done about them. Usually if there's not a bulge at the end of the distribution though, there's not much that can be done. The other useful thing is the prefix list. So for instance in that MM.txt, you see this:
frozenset({('UNII', 2), ('CAS', 1), ('PUBCHEM.COMPOUND', 2), ('INCHIKEY', 1)}) 1
frozenset({('MESH', 1), ('PUBCHEM.COMPOUND', 1), ('UNII', 2), ('CHEMBL.COMPOUND', 1), ('UMLS', 3), ('INCHIKEY', 1)}) 1
frozenset({('MESH', 1), ('PUBCHEM.COMPOUND', 1), ('HMDB', 1), ('UNII', 1), ('CHEBI', 1), ('CHEMBL.COMPOUND', 1), ('INCHIKEY', 1)}) 1
...
frozenset({('INCHIKEY', 1), ('PUBCHEM.COMPOUND', 1), ('CHEMBL.COMPOUND', 1)}) 87120
frozenset({('CAS', 1), ('PUBCHEM.COMPOUND', 1), ('INCHIKEY', 1)}) 196331
frozenset({('INCHIKEY', 1), ('PUBCHEM.COMPOUND', 1)}) 5863401
So the most common clique/cluster consists of a pubchem.compound id and an inchikey, and there are 5.8 M like that. The next biggest group is that plus a CAS, with almost 200k, and so on down to clusters that only happen once, like some compund that has a mesh, a pubchem, an hmdb, a unii, a chebi, a chembl, and an inchikey.
So to figure out what's going on with the big % change in the compendia, I would look at the differences in these reports and see what is getting merged/split differently. With those big % changes I think they'll be easy to find? And then it's a matter of digging around and looking at examples and seeing if it's an improvement or not.
I've made half hearted attempts to compare directly two compendium runs to see what changed, but have never found anything I liked very much. Happy to try other ideas though.
I've gotten some working diff-Babel-runs code that took around 9.5 hours (and over 500GB of memory!) to run a comparison on the entire Babel run between Jan 1 and May 21, 2022.
This comparison works by making a list of all the identifiers used in either the previous version of the compendium or the current version of the compendium -- for example, OrganismTaxon.txt has 2,494,604 such identifiers. For each identifier, it then collects all the cliques/clusters from the previous compendium and the current compendium, and assigns each identifier to one of a few general categories (detailed in https://github.com/gaurav/babel-validation/pull/5) -- for example, DELETED
means that the identifier was present in one or more clusters/cliques in the previous compendium and not present in new compendium. My plan is to come up with additional ways of clustering identifiers based on measures that are useful to determine if we've lost data or started overly grouping clusters/cliques.
Some highlights:
ChemicalEntity.txt
: 12,449,396 or 95.97% of the identifiers from prev and current compendia, and Polypeptide.txt
has 362 or 35.84% of the identifier union).SmallMolecule.txt
(55.68%), MolecularMixture.txt
(37.35%), Polypeptide.txt
(14.06%), ComplexMolecularMixture.txt
(7.81%) and Protein.txt
(5.95%) all have high percentages of "changed with identical identifiers", i.e. where an identifier has the same set of identifiers in both previous and current compendia. Unless I've got a bug in that part of my code, this should only happen when the labels on identifiers change, which is likely just a source database being improved.GrossAnatomicalStructure.txt
(1630 or 10.91%). There's no obvious reasons why so many identifiers has been deleted here -- they point to a variety of sources (mostly NCIT, some UMLS, MESH, UBERON) and the labels for these identifiers appear to be correct (see /scratch/gaurav/babel-outputs/2022may21/babel_outputs/reports/diff_to_2022jan1/GrossAnatomicalStructure.txt
for all the changes grouped by change type), so this looks to me like we're losing information for some reason. I wonder if maybe Uberon or another data source removed lots of cross-references to NCIT and some other data sources a little over enthusiastically.Disease.txt
(9.18%), ComplexMolecularMixture.txt
(5.71%), GrossAnatomicalStructure.txt
(13.02%), Cell.txt
(5.57%), MolecularActivity.txt
(4.28%) and Gene.txt
(3.84%). This is probably too many to look through manually and figure out any kinds of patterns, so I'll try to look at them and figure out if I can find any patterns of interest there -- for example, some identifiers (such as ORPHANET:2055
) are being removed from a clique of 6 IDs into its own identifiers, which seems... not great. I'll make a separate category for some of those patterns and then will rerun my diffing tool.Also some practical notes:
The full lists of all cluster diffs by identifier are on Hatteras at /scratch/gaurav/babel-outputs/2022may21/babel_outputs/reports/diff_to_2022jan1
.
This is excellent progress. I wonder if we should be versioning the intermediate products. For instance, in the GrossAnatomicalStructure stuff, if we had the previous relationship files, we could see that e.g. the one built off of source X is suddenly smaller...
I would like for us to target having a dev release of this build by the end of August.
This is excellent progress. I wonder if we should be versioning the intermediate products. For instance, in the GrossAnatomicalStructure stuff, if we had the previous relationship files, we could see that e.g. the one built off of source X is suddenly smaller...
Implemented in PR https://github.com/TranslatorSRI/Babel/pull/57.
I've just completed my first run of Babel on Sterling (on a container with 500GB of memory!) using the changes in draft PR https://github.com/TranslatorSRI/Babel/pull/37. The results I've obtained (on Hatteras at
/scratch/gaurav/babel-outputs/2022apr4
) has lots of differences from the 2022-01-01 run, but I haven't come up with a good way of summarizing the changes or figuring out if it's working "correctly".I've tried using diff/diffstat, but there are tons of changes, so it's not easy to see how signficant the changes are. I tried diffing some files individually, and was able to find a few patterns: for example, the polypeptide
LSM-37009
insynonyms/Polypeptide.txt
is referred to as CHEBI:125504 in the new run and INCHIKEY:GGLDQJNBYFODOM-RDCMKPLUSA-N in the previous run.Diffstat comparison of Jan 1 and Apr 4 Babel runs
``` compendia/AnatomicalEntity.txt |284873 compendia/BiologicalProcess.txt |55258 compendia/Cell.txt |15690 compendia/CellularComponent.txt |24855 compendia/ChemicalEntity.txt |6976071 compendia/ChemicalMixture.txt | 889 compendia/ComplexMolecularMixture.txt | 296 compendia/Disease.txt |654029 compendia/Gene.txt |77898179 ++--- compendia/GeneFamily.txt |55418 compendia/GrossAnatomicalStructure.txt |20397 compendia/MolecularActivity.txt |294143 compendia/MolecularMixture.txt |16366879 - compendia/OrganismTaxon.txt |4783919 compendia/Pathway.txt |104290 compendia/PhenotypicFeature.txt |700283 compendia/Polypeptide.txt | 753 compendia/Protein.txt |456484451 ++++++++++++++++----------------- compendia/SmallMolecule.txt |204804339 +++++++------- conflation/GeneProtein.txt |16857753 - reports/AnatomicalEntity.txt | 100 reports/BiologicalProcess.txt | 17 reports/Cell.txt | 73 reports/CellularComponent.txt | 60 reports/ChemicalEntity.txt | 1451 reports/ChemicalMixture.txt | 20 reports/ComplexMolecularMixture.txt | 23 reports/Disease.txt | 8278 reports/Gene.txt | 72 reports/GeneFamily.txt | 8 reports/GrossAnatomicalStructure.txt | 80 reports/MolecularActivity.txt | 70 reports/MolecularMixture.txt | 2310 reports/OrganismTaxon.txt | 12 reports/Pathway.txt | 8 reports/PhenotypicFeature.txt | 1175 reports/Polypeptide.txt | 30 reports/Protein.txt | 445 reports/SmallMolecule.txt | 8790 reports/disease_completeness.txt | 69 reports/process_completeness.txt | 4 synonyms/AnatomicalEntity.txt |624380 synonyms/BiologicalProcess.txt |224432 synonyms/Cell.txt |43534 synonyms/CellularComponent.txt |57426 synonyms/ChemicalEntity.txt |1428975 synonyms/ChemicalMixture.txt | 3566 synonyms/ComplexMolecularMixture.txt | 1672 synonyms/Disease.txt |2407347 synonyms/Gene.txt |1060645 synonyms/GeneFamily.txt |55418 synonyms/GrossAnatomicalStructure.txt |105021 synonyms/MolecularActivity.txt |393102 synonyms/MolecularMixture.txt |16811698 - synonyms/OrganismTaxon.txt |139483 synonyms/Pathway.txt |109508 synonyms/PhenotypicFeature.txt |1712740 synonyms/Polypeptide.txt | 3133 synonyms/Protein.txt |2871694 synonyms/SmallMolecule.txt |215593042 +++++++-------- 60 files changed, 525618409 insertions(+), 504434267 deletions(-) ```Probably the best way to compare the changes is by comparing line counts, which shows that most files are pretty similarly sized, except for
compendia/ChemicalEntity.txt
(which is 1577.58% bigger),compendia/MolecularMixture.txt
(58.43% bigger) andsynonyms/MolecularMixture.txt
(56.51% bigger).Does anybody have suggestions for comparing/validating the new Babel output before we try to move it to the dev server? We could for instance dump all the IDs alphabetically and run a massive diff on that. Having some method to do this would help with #36 as well.