Overestimation with new version

Vivianstats / scImpute

Accurate and robust imputation of scRNA-seq data

https://www.nature.com/articles/s41467-018-03405-7

90 stars 34 forks source link

Overestimation with new version #13

Closed carisdak closed 6 years ago

carisdak commented 6 years ago

Hi,

I recently tried your newest version (v0.0.6) on my data but I encountered some problem. My experiment consists in taking an actual dataset, masking a part of its value(artificially setting to 0) and see if scImpute successfully recovers the data. However, I could not make it work and maybe I did something wrong (like some preprocessing steps). I used K=1 and default parameters. Do you have an idea about this issue? Here is what I observe:

Thanks

Vivianstats commented 6 years ago

Hello Nak0r,

I'd like to ask a few questions to help me identify the problem.

Do you have the same problem with v0.0.5 or previous versions? Are the masked values randomly selected? Are there already dropouts before you manually introduce 0s? Can I take a look at the plot on the log10 scale?

Thanks, Vivian

carisdak commented 6 years ago

Thank you for your help. here is what I tried:

I noticed this issue with v0.0.5 and v0.0.6, but it worked very well with v0.0.2. I could not really try v0.0.3 and v0.0.4 because there was a problem with Kcluster=1.
The values are indeed randomly selected and you're right, there was already dropouts before I introduced these 0s (but there is a small amount of added 0s so that it does not change the dataset too much)
Here is the plot on the log scale:

Some more detail about the procedure:

I worked with the raw counts as recommended in the readme.
I masked a small part of the data and I extract the first 2000 genes based on their mean value.
I imputed the masked matrix using scImpute's default parameters (drop_thre=0.5) and Kcluster=1. I have then 2 dataset: the "true" data and the imputed data.
Finally, I compared the recovered masked values with the true ones.

Thanks, Cedric

Vivianstats commented 6 years ago

Hello Cedric,

Thanks for sending me the information.

Just to make sure that I understand correctly, you first masked the values and then extract 2000 genes. So the input for scImpute is a count matrix with 2000 rows? Also, have you investigated the raw data and is it safe to assume a single population?

Best, Vivian

Vivianstats commented 6 years ago

I also wonder if you can send me a reproducible example so that I can test it with modified scImpute?

carisdak commented 6 years ago

Hi Vivian,

You understood right about the masking. Regarding the clustering, I did not investigate it since I am working with a cell line. Do you think clustering could still work?

Vivianstats commented 6 years ago

Hello Cedric,

If you are working with a cell line, I think it's better to keep kcluster = 1.

I have tested your example dataset and I think there are two things you may try to improve the results.

First, we recommend using the whole-genome matrix (with at least 10,000 genes) and this ensures that identification of dropouts is robust.

Second, I have made slight changes to the algorithm to avoid extreme values in regression. So you can reinstall scImpute from Github and should be able to obtain a comparison like scatter

Hopefully, you can get even better results when you try the whole-genome matrix.

Best, Vivian

carisdak commented 6 years ago

Hi,

Thank you very much for your help. I indeed have a slightly better result now. I will let you know if I managed to improve it even further (by taking more genes as you mentioned).

Best, Cedric

Vivianstats commented 6 years ago

Hello Cedric,

I'm closing this issue but feel free to open a new one if you have further questions.