ersilia-os / zaira-chem

Automated QSAR based on multiple small molecule descriptors
GNU General Public License v3.0
27 stars 10 forks source link

Distilled ZairaChem models in ONNX format #32

Open miquelduranfrigola opened 9 months ago

miquelduranfrigola commented 9 months ago

Motivation

ZairaChem models are large and will always be large, since ZairaChem uses an ensemble-based approach. Nonetheless, we would like to offer the opportunity to distill ZairaChem models for easier deployment, especially in online inference. We'd like to do it in an interoperable format such as ONNX.

The Olinda package

Our colleague @leoank already contributed a fantastic package named Olinda that we could, in principle, use for this purpose. Olinda takes an arbitrary model (in this case, a ZairaChem model) and produces a much simpler model, stored in ONNX format. Olinda uses a reference library to do the teacher/student training and is nicely coupled with other tools that @leoank developed such as ChemXOR for privacy-preserving AI and Ersilia Compound Embedding which provides dense 1024-dimensional embeddings.

Roadmap

GemmaTuron commented 4 months ago

We will start by testing again Olinda @JHlozek (see: https://github.com/ersilia-os/olinda/issues/3)

JHlozek commented 3 months ago

I've been working on this and currently I have Olinda installed in the ZairaChem environment (requiring some dependency conflict resolution as usual). I have a version of the code that can invoke the ZairaChem pipeline and collect its output to complete the pipeline, so in principle this works.

There are still many improvements to work on next, including:

miquelduranfrigola commented 3 months ago

Thanks @JHlozek this is great.

JHlozek commented 2 months ago

Olinda updates: The ZairaChem distillation process runs successfully, now with pre-calculated descriptors too. This includes the above points of suppressing the extensive output produced by ZairaChem and merging the model training set with the pre-calculated reference descriptors.

As a test, I trained a model on H3D data up to June 2023 including 1k pre-calculated reference descriptors and then predicted prospective compounds from the next 6 months. Here is the scatter plot of how the distilled and zairachem model predictions compare and a ROC-curve for the distilled model on prospective data.

Image

Image

To facilitate testing, I have written code that will prepare a folder of pre-calculated descriptors for 1k molecules, which can be run in the first cell in the demo notebook. For testing, perform the following steps:

I suggest testing this and then closing #3 to keep the conversation centralized here. I'll post next steps following this.

JHlozek commented 2 months ago

Next steps:

miquelduranfrigola commented 2 months ago

This is very interesting and promising, @JHlozek ! There seems to be a tendency towards false negatives (upper-left triangle in your plot). This is interesting and hopefully can be ameliorated with (a) more data and/or (b) including the training set. Great progress here! Exciting

GemmaTuron commented 2 months ago

Summary of the weekly meeting: the distilled models look good but there seems to be a bit of underfitting as we add external data, so we need to make the ONNX model a bit more complex. In addition, we will look for data to validate the "generizability" of the model - from H3D data (@JHlozek ) and ChEMBL (@GemmaTuron )

GemmaTuron commented 2 months ago

Hi @JHlozek

I have a dataset that contains IC50 data for P.Falciparum, over 17K molecules with Active (1) and Inactive (0) defined at two cut-offs (hc = high cut-off, 2.5 uM / lc = low cut-off, 10 uM). They are curated from ChEMBL - all public data I do not have the strain (it is a pool) but we can assume most of it will be in sensitive strains, and likely NF54. Let me know if these are useful!

pfalciparum_IC50_hc.csv

pfalciparum_IC50_lc.csv

miquelduranfrigola commented 2 months ago

This looks pretty good @GemmaTuron - many thanks.

JHlozek commented 1 month ago

Some updates for Olinda that we spoke about yesterday. I have been working on improving the speed of the Olinda pipeline by addressing the list above of steps that need to be run at runtime. I am concurrently writing the script that can convert a given reference list of smiles into the expected directory structure.

Overall, the pipeline has gone from >10 hours for 50k reference molecules to ~45 minutes. Half an hour of this is still due to the tabpfn step, which we may want to discuss addressing further in future.

Next, I am working on implementing the sample weights to weight the original model's training set higher than than the general reference smiles.

miquelduranfrigola commented 1 month ago

Fantastic @JHlozek thanks for the updates.

RE:

Let's address TabPFN in our meeting.

About the weighting scheme - does it seem difficult?

JHlozek commented 1 month ago

Thanks @miquelduranfrigola. Some more updates:

The weighting is now implemented and wasn't too difficult - the generators just need to return a third value which KerasTuner automatically treats as the weight. At the moment I find the proportion of training compounds to the reference library and use the inverse as the weight. I'm exploring extending this weighting scheme to account for the large difference between low-scoring and high-scoring compounds.

I now have 200k compounds pre-calculated. We should maybe start thinking about how we store and serve these (like from an S3 bucket).

miquelduranfrigola commented 1 month ago

Very interesting. Thanks @JHlozek

100% agree that we need to have a data store for ZairaChem descriptors, and the right place to put this is S3. In principle, it should not be too difficult - they are in HDF5 format, correct?

Tagging @DhanshreeA so she is in the loop.

JHlozek commented 1 month ago

@miquelduranfrigola

Most of the descriptors are .h5. The two bidd-molmap files are .np array files and then there are some txt files in the formats that ZairaChem expects. We might want to zip each fold to a single file.

The folder structure for each 50k fold of data is as follows: reference_library.csv /data/data.csv /data/data_schema.json /data/mapping.csv /descriptors/cc-signaturizer/raw.h5 /descriptors/grover-embedding/raw.h5 /descriptors/molfeat-chemgpt/raw.h5 /descriptors/mordred/raw.h5 /descriptors/rdkit-fingerprint/raw.h5 /descriptors/eosce.h5 /descriptors/reference.h5 /descriptors/bidd-molmap_desc.np /descriptors/bidd-molmap_fps.np

I'm going to remove the duplication of grover embedding by pointing the manifolds to /grover-embedding/raw.h5 instead of the separate reference.h5 file.

miquelduranfrigola commented 1 month ago

Fantastic. Definitely, we need to keep this as zip files in S3 and perhaps write a short script to fetch those file easily?