About gene expression input to the model

sunset222 commented 8 months ago

Hi, thank you for your amazing work.

However, I am quite confused about the gene expression input to the model (both paccmann, and RL). 1. In PaccMann predictor Your recent model used rna-seq data, and the dataset you uploaded (~400 cell lines) cannot cover the full cell lines in GDSC (~1000 cell lines). And also there are some missing genes among the 2,128 selected genes.

Can you explain how the model handles the expression values of missing genes? and also how does the model handle the data points that are missing in the cell line - gene expression dictionary?

2. In PaccMann RL (Generator) As you mentioned in the readme, you used the rna-seq gex data for the whole framework. but It seems like the input gene expression for conditional generation (the pickle file) was RMA-normalized gene expression. The reason why I thought like that is because of the reasons that I mentioned above. (RMA data covered the most of cell lines (985) and it contains 2,128 selected genes)

You mentioned in the paper, the PVAE is trained with TCGA rna-seq data. Thus I think there might exist a discrepancy when you encode the RMA gene expression with the PVAE encoder. Can you explain the exact source of the pickle file (gdsc_transcriptomics_for_conditional_generation.pkl) and the reason why you do not use that pickle file in the other part? (PaccMann predictor)

jannisborn commented 8 months ago

Hi @sunset222,

Thanks for the interest in our work.

The data we shared alongside the original PaccMann paper (https://pubs.acs.org/doi/10.1021/acs.molpharmaceut.9b00520) comes in the form of TFRecords and should contain data for all ~1000 cell lines in GDSC. See https://ibm.biz/paccmann-data . If you prefer starting from a human-readable format rather than binaries, please check out https://ibm.ent.box.com/v/paccmann-pytoda-data and go to splitted_data. You will find there train/test files with IC50 data for ~1000 GDSC cell lines. For the gene expression data, check out the data/gene_expression folder. Regarding missing genes, we do zero-imputation. This is identical to a mean-imputation since we standardize data per gene before model training.

So, for the original PaccMann paper and everything I wrote above we used RMA data indeed. Now, in the PaccMannRL paper, as you say, we train the PVAE with RNA-Seq data from TCGA so your observation is correct. We explained this in S1.2 of the PaccMannRL paper:

For RL optimization of G, we used GEPs publicly available from GDSC (Yang et al., 2012) and
CCLE (Barretina et al., 2012) databases. Since the RNA-Seq of these cancer cell line databases were
passed through the PVAE (pretrained on human samples from TCGA (Weinstein et al., 2013)), we
compared the standardized gene expression distributions for the selected genes across these databases
and found good agreement (compare Figure S4 in Supplementary Material S2), in alignment with
the reported consensus between transcriptomic data in CCLE and TCGA (Ghandi et al., 2019). To
train the critic (C), IC50 drug sensitivity data from GDSC and CCLE was utilized.

The reason why we didnt use the pickled file in the PaccMann predictor is because the project grew organically. paccMann predictor existed first, then we conceived PaccmannRL

Hope this helps, let me know if you need more help

sunset222 commented 8 months ago

Thank you for your fast and kind reply. I would use the pickle file to rebuild your framework. :)

PaccMann / paccmann_rl

About gene expression input to the model #9