KlugerLab / ALRA

Imputation method for scRNA-seq based on low-rank approximation
MIT License
73 stars 19 forks source link

Running ALRA with multiple samples #21

Open msraredon opened 2 years ago

msraredon commented 2 years ago

I am working on a project to compare multiple samples. Is it better to run ALRA on each sample individually or to merge them all together and then run ALRA on the whole combined dataset?

The full dataset contains samples from different chemistries and different biological conditions, and were collected by various users across years, so it is safe to assume that there are both technical batch effects and biological differences overlaid on one another.

Curious if you have any feedback or experience with this question.

S

dylanmr commented 1 year ago

would love to hear more about this. I have tried running ALRA on the combined dataset and was surprised about how well it preserved the biological differences, but have not tested running on individual samples yet.

Rohit-Satyam commented 1 year ago

I discussed this issue with experts in bioconductor community and they advised to run alra separately on separate samples. But I did it both ways and was surprised that alra performs better if you merge the raw counts, normalize all samples together and then run alra on them (sorry can't share plots as of now). I had data from two time points (ctrl and treatment) and surprisingly the normal samples were imputed alike and vice versa with treatment.

Since we knew at what time points we harvested the cells and we had bulk time series data, we used singleR to transfer labels onto this imputed data and the results for merge first and then impute were accurate. When imputed separately, we observed cells getting assigned to time points that were 8 to 10 hours ahead.

Next we had another single cell atlas with infection stages known with us and again we used this to check if the Imputation affected the stage assignment. For separate Imputation we observed a slight increament in later stages (which is not possible coz we have imaging data) but not for the data obtained after merging and then imputing. This way we observed that merging matrices before Imputation is best way and later on u can subset this imputed matrix and perform downstream analysis.

However, we are facing problem with integration step. I suspect this is due to differences in the data distribution of my imputed data and the single cell atlas which I am using without Imputation. My atlas follows nb distribution but with imputed data I observe tailing at the end of mean-var plot.

Rohit-Satyam commented 1 year ago

@dylanmr would u like to exchange emails since we have same results, maybe we can learn from each other's experience of alra and perhaps could write down a blog or something?

msraredon commented 1 year ago

I also have come to the conclusion that merging and then imputing with ALRA seems to work "better", in that the imputed data done this way seems to preserve more of the cross-sample trends that can be observed in the raw data. I have not tested this thoroughly though, because imputing a full merged object can take large time and memory and I would need to operate on a large more powerful computer to more deeply prove this. I think that the relevant metric, btw, at least for our applications, is how closely the imputed data preserves the relative normalized gene expression levels seen across time point in the un-imputed data. Good preservation allows reliable differential expression to then be performed on the imputed slot. Whereas bad preservation of relative trends makes differential expression on the imputed slot non-reliable. I've wrestled with this a lot when analyzing NICHES data, and I would love to have a clear answer to this question / way of finding a good answer.

iichelhadi commented 9 months ago

I have a simillar question. I am trying to merge multiple seurat object. The merge fails if I run alra before the merge. my issue is that I am merging datasets with highly variable sequencing depths. Any insight would be helpful

ghost commented 8 months ago

While analyzing large amounts of data, I encountered a similar problem and conducted some tests. Based on the test results from my data, my current view is that integrated normalization of the data appears to have the most significant impact. Implementing integrated normalization followed by ALRA imputation for each group, cell type, or individual patient before and after treatment seems to produce better outcomes. However, I also need many opinions from others.

RaredonLab commented 8 months ago

I would be cautious about integrating data and then imputing based on an integrated ('pseudo-value') data slot. I think ALRA is only designed to take normalized data matrices (not integrated pseudo-value matrices) as input. But maybe I am misunderstanding the above comment.

Bottom line: our group is at this point using ALRA almost exclusively on 'total datasets'. This is because Dr. Linderman, the lead author on the ALRA study, explained to us that you want ALRA to be working with the same variance that is to be studied. So, we merge all of our samples into one object, and then impute that whole object. Generally requires cluster-level compute resources for greater than 40,000 cells.

Additionally, we have added a (critical, for NICHES analysis) step where we limit ALRA to operating on genes that are expressed in at least a minimum number of cells, usually 25 or 50. This is sort of a duct-tape solution, but it works pretty well at reducing 'false positives' i.e., genes that are highly specific but have only with only one or two total transcripts in the entire dataset getting imputed to hundreds or sometimes thousands of positive values. This minor alteration has massively improved the reliability of our ALRA-imputed results, i.e., the imputed trends agree much more with the source RNA trends.

iichelhadi commented 8 months ago

thank you for the reply. I am integrating a large number of studies using Harmony and wasn't sure if I should use ALRA after integration. Before integration doesn't work in my hands. Some of my datasets are from old sequencing platforms that have really high sequencing depths but the majority are 10X but also variable in some cases. Before integrating I perform normalization and scaling while correcting for percent mito, UMI count and gene counts and for Harmony I integrate on batches. I am not sure if I can run ALRA after Harmony integration

RaredonLab commented 8 months ago

Interesting. I would be curious to see results of ALRA after integration by Harmony or others. I still think it probably needs to be imputed as a large global object for cross-sample trends to be preserved well, but I could be wrong. Also, it's tricky because integrated values are I think basically never supposed to be used for differential testing / analysis, since they can be wildly different from the normalized values and can't really be interpreted as 'expression level' -- this post talks about this a lot and links to an extended network of posts discussing this issue: https://github.com/satijalab/seurat/discussions/5452

iichelhadi commented 8 months ago

Thank you. will look into this Regards

ssukumaran2 commented 2 months ago

I would be cautious about integrating data and then imputing based on an integrated ('pseudo-value') data slot. I think ALRA is only designed to take normalized data matrices (not integrated pseudo-value matrices) as input. But maybe I am misunderstanding the above comment.

Bottom line: our group is at this point using ALRA almost exclusively on 'total datasets'. This is because Dr. Linderman, the lead author on the ALRA study, explained to us that you want ALRA to be working with the same variance that is to be studied. So, we merge all of our samples into one object, and then impute that whole object. Generally requires cluster-level compute resources for greater than 40,000 cells.

Additionally, we have added a (critical, for NICHES analysis) step where we limit ALRA to operating on genes that are expressed in at least a minimum number of cells, usually 25 or 50. This is sort of a duct-tape solution, but it works pretty well at reducing 'false positives' i.e., genes that are highly specific but have only with only one or two total transcripts in the entire dataset getting imputed to hundreds or sometimes thousands of positive values. This minor alteration has massively improved the reliability of our ALRA-imputed results, i.e., the imputed trends agree much more with the source RNA trends.

Hi, Thanks for this information. Can you please share the code to limit ALRA to operating on genes that are expressed in at least a minimum number of cells, usually 25 or 50. I see the thresh arguement default settings at 6, but not sure if this is where to set it. Thanks!