predictability score - Githubissues

Tako-liu commented 1 month ago

Hello, this is an excellent piece of work. Based on the code provided in the "MERFISH_scRNAseq_integration.ipynb" notebook, I have combined my MERFISH data with single-cell RNA-seq data and inferred the expression of genes that are not present in the MERFISH data. In the methods, I saw you mentioned a "predictability score" and wrote that it is shown in Supplementary Fig. 14a. Could you also share the code for the "predictability score" so that I can calculate the scores for my own data? Thank you once again for such an outstanding job!

Harrison-Q-Ma commented 1 month ago

Hello. Thank you for your interest in our work. There are many different implementations for Pearson's correlation in both Python and R. You may use scipy.stats.pearsonr() which will get the job done. In that function, you will pass in the predicted expression for gene x and measured expression for gene x as the two parameters. Please let me know if I can be of more help.

Harrison

Tako-liu commented 1 month ago

Thank you for your prompt response. I understand that I can use scipy.stats.pearsonr() for the score calculation. I would like to confirm what the input data should be.

In my case, I have predicted genes that were not detected in the merfish data using single-cell data, which is the predicted expression. So, for the actual data expression, should I use the single-cell data or the merfish data?

Harrison-Q-Ma commented 1 month ago

For unmeasured genes, it doesn't make sense to do any correlation since you don't have a ground-truth. I would suggest doing one of 2 things to show that your gene expression transfer make sense:

K-fold cross validation to compare predicted expression to real expression on the hold-out data. You would probably prefer this one since it doesnt incur extra cost.
RNA scope or smFISH to measure the expression in vivo and compare to predicted. This is good if you just want to validate a few genes.

Tako-liu commented 1 month ago

Thank you for your timely reply.

So, the predictability score is only applicable for calculating genes that are present in both the single-cell data and the merfish data, and for genes that were not detected in the merfish, it is not possible to calculate the predictability score. To verify the authenticity of these genes, one can only use K-fold cross validation to test the capability of the NearestNeighbors model, or perform smfish for validation. Is that what you mean?

Then I would like to ask, what is the main meaning of this graph?

kernco commented 1 month ago

Hi, for calculating the predictability scores for genes not in the MERFISH data, what we did was take out a random subset of our sequencing data and reduce it to the set of genes measured in MERFISH and used this as a "mock" MERFISH dataset. We then used our gene imputation method to predict the expression of genes in this dataset, which we could then correlate with the actual expression values. This can be done in a cross-validation framework, e.g. doing 10 iterations using 10% of the data as the mock MERFISH dataset on each iteration then combining the result.

ChiLab-UCSD / Heart_MERFISH_analysis

predictability score #2