the meaning of output files

BioAmelie commented 4 years ago

Hi @almaan,

I have successfully run stereoscope, and this is my code stereoscope run --sc_cnt ../All_ko3_celltype_cnt.tsv --sc_labels ../All_ko3_celltype_meta.tsv -o 3170_v2 --st_cnt ../st_cnt.tsv --gpu -mc 10 -stb 2048 -scb 2048 -n 5000 , but the W.2020-07-17072307.309946.tsv is odd, for example, some cell type should not have so high cell proportion. Therefore, I want to understand these output files to evaluate whether the output is right. Can you tell me the meaning of logits.2020-07-17072307.309946.tsv and R.2020-07-17072307.309946.tsv and which parameters you recommend to tune to get a more reliable result? What's more, I am intended to filter ribosome and mitochondria genes and only input marker genes for single-cell sequencing cell types, do you think it is feasible? I am also confused about how do you define top n most highly expressed genes, does the most highly expressed gene represent those genes that have higher mean value among cells?

almaan commented 4 years ago

Hi @BioAmelie,

great to hear that you are exploring stereoscope! Judging from the command you posted, it seems that you are running with the default number of epochs which is set to 20000in both steps - I would probably recommend you to increase this to at least 50000if you are working with Visium data, and maybe even more. My advice would be to check if your system have converged (see Section "3.2 Monitoring progress" in the README), or whether it needs to be run for a longer time. If you don't want to restart the whole process, you can use the -scm command (to stereoscope run -h for more info) to continue the fitting of an already existing model.

The logits.tag.tsv and R.tag.tsv files are not really useful for assessing the state of your system or whether you results are correct; these are the rate (R) and log odds (logits, a different way of describing the success probabilities) in the negative binomial distribution - which is the underlying statistical model that is used during the inference. We use the exact same parametrization as the PyTorch implementation of the Negative Binomial. For more information regarding the rates and success probabilities and what they represent in the model I would refer to the bioRxiv pre-print where this is thoroughly described in the Methods sections.

The top n most highly expressed genes are taken as those with highest total sum across all cells in the single cell data. You can definitely "spike" you analysis by specifying a custom set of genes to be analyzed, this has showed good results in other studies, e.g. this one. However "only" using the marker genes is something I haven't tried, but would be slightly reluctant to try unless this list is fairly large. The way you do this - specify a custom gene list - is by creating a txt file where all the genes you want to include in the analysis are listed one per row, then in you analysis use -gl GENELIST.txt and stereoscope will use these genes.

Good luck with the continued analysis! Alma

BioAmelie commented 4 years ago

Hi @almaan,

Sorry for my later reply. I will follow your suggestion.

minfang

almaan commented 4 years ago

Hello @BioAmelie,

hope things work out for you, if you feel as if your questions have been answered, I would ask you to close this issue. Of course, if you want to continue the discussion, you may leave it open.

Best Alma

BioAmelie commented 4 years ago

Hi @almaan,

My ST data is from mouse lung, can you tell me what I should keep in mind when I select a custom set of genes to be analyzed? I want to combine cell type marker gene and highly expressed genes expect ribosome and mitochondria gene, do you think is it feasible?

almaan commented 4 years ago

Sounds like a great start - as eluded to above - we constructed a similar custom list when analyzing breast cancer data with some really promising results. Also make sure the system converges, otherwise the mapping will not be optimal!

Best of luck Alma

almaan / stereoscope

the meaning of output files #11