AlmaasLab / csdR

Efficient CSD implementation in R
https://almaaslab.github.io/csdR
GNU General Public License v3.0
6 stars 1 forks source link

Variance and C-, S-, D- scores not showing #4

Open GettyScience opened 1 year ago

GettyScience commented 1 year ago

Hello,

So I am an undergraduate, so it is possible that my errors are easily an ignorance issue, but I am trying to analyze RNA-seq data through your data. I work in a lab looking for gravitropic genes in the Arabidopsis Thaliana model. We have a dataset of previously examined RNA-seq data that I am trying to run through the code in R but the results I have found are confusing to say the least. When bootstrapped at 10, we have no variance and the C- and D- values cap out at "infinity". When bootstrapped at 100, we get no C-, S-, or D- scores with only numbers showing in the Rho2 and var2.

I am running the analysis on my laptop (16Gb), but despite a longer wait it still runs just fine. Our data also only has 4 samples per treatment and tissue, so the analysis is >27,000 genes but only 4 samples. Would either of these relate to the issues we are finding or could you offer any more advice for this issue?

Thank you,

yaccos commented 1 year ago

Thank you for your request. Unfortunately, 4 samples per treatment is too low considering that you have more than 27,000 genes. With such as high number of genes, I would recommend having more than 100 samples per treatment.

GettyScience commented 1 year ago

Thank you for the quick reply. Would you be willing to elaborate on why the small sample size causes the calculations to fail? In our field, small samples are common as the model is very inbred and genetically controlled. Is there a way to get past this issue within the larger code or would it remain inappropriate for the statistical methods being used?

Thank you,

yaccos commented 1 year ago

Considering that the csdR makes an all-to-all comparison of the genes, it reports C-,S- and D-values for 364,486,500 gene pairs when running it with 27,000 genes. Having just 4 samples per treatment will therefore make a lot of spurious associations. csdR was originally intended for large-scale clinical studies where sufficiently high samples sizes are available, but obtaining that amount of data is often unfeasible for other types of studies. For your sample size, there are still some analyses you can use. You could for instance measure the fold change in gene expression between the two conditions and show the results in a vulcano plot.