broadinstitute / wot

A software package for analyzing snapshots of developmental processes
https://broadinstitute.github.io/wot/
BSD 3-Clause "New" or "Revised" License
137 stars 34 forks source link

Preparation of expression matrix #87

Open jma1991 opened 3 years ago

jma1991 commented 3 years ago

I am a bit confused about the "preparation of expression matrices" section in the STAR methods of the optimal transport paper. You define three different expression matrices: the UMI matrix, the log-normalized expression matrix, and the truncated expression matrix. Which of these should be used as the input expression matrix? Additionally, if there are multiple batches of cells, should the expression matrix be batch-corrected beforehand? Either by regression or some other method (e.g. fastMNN). I couldn't find any mention of batch correction in the STAR methods, apart from:

"The expression matrix was downsampled to 15,000 UMIs per cell. Cells with less than 2000 UMIs per cell in total and all genes that were expressed in less than 50 cells were discarded, leaving 251,203 cells and G = 19,089 genes for further analysis. The elements of expression matrix were normalized by dividing UMI count by the total UMI counts per cell and multiplied by 10,000 i.e., expression level is reported as transcripts per 10,000 counts."

I'm not sure why the data was downsampled and then normalized by UMI count? Doesn't the first correction make the second redundant? Also, is this downsampled / normalized matrix different from the three defined above? If so, should I be using this as the input expression matrix instead? Finally, you use library size as a scaling factor to correct for differences in sequencing depth, but is there a requirement to normalize for compositional biases as well? (e.g. using pool-based size factors)

Thanks, James

geoffschieb commented 3 years ago

Hi James,

We used the log-normalized matrix for the OT computations. The truncated expression matrix was used in the regulatory regressions.

We didn't do batch correction. This allowed us to use the distance between batches as a base-line for performance in our geodesic interpolation computations. We tested the batch effect for each time-point and identified one or two time-points with large batch effects and removed those corrupted samples.

As for your last question, we downsample to a maximum of 15,000 UMIs. We still need to normalize by UMI count because most of the cells have less than 15,000 UMIs. This expression matrix is similar to the log-normalized expression matrix (the final difference is the log + 1).

Something like SCTransform also works well, as we describe in our newer work: https://www.biorxiv.org/content/10.1101/2020.11.12.380675v1

Best, Geoff

On Mon, Dec 14, 2020 at 1:51 PM James Ashmore notifications@github.com wrote:

I am a bit confused about the "preparation of expression matrices" section in the STAR methods of the optimal transport paper. You define three different expression matrices: the UMI matrix, the log-normalized expression matrix, and the truncated expression matrix. Which of these should be used as the input expression matrix? Additionally, if there are multiple batches of cells, should the expression matrix be batch-corrected beforehand? Either by regression or some other method (e.g. fastMNN). I couldn't find any mention of batch correction in the STAR methods, apart from:

"The expression matrix was downsampled to 15,000 UMIs per cell. Cells with less than 2000 UMIs per cell in total and all genes that were expressed in less than 50 cells were discarded, leaving 251,203 cells and G = 19,089 genes for further analysis. The elements of expression matrix were normalized by dividing UMI count by the total UMI counts per cell and multiplied by 10,000 i.e., expression level is reported as transcripts per 10,000 counts."

I'm not sure why the data was downsampled and then normalized by UMI count? Doesn't the first correction make the second redundant? Also, is this downsampled / normalized matrix different from the three defined above? If so, should I be using this as the input expression matrix instead?

Thanks, James

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/87, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJCQYVTTGXMDRTBK7E3D6DSU2CGXANCNFSM4U3MHFYA .

jma1991 commented 3 years ago

Hi Geoff,

Thanks for the super fast reply!

Hi James, We used the log-normalized matrix for the OT computations. The truncated expression matrix was used in the regulatory regressions.

Okay, thanks for clarifying.

We didn't do batch correction. This allowed us to use the distance between batches as a base-line for performance in our geodesic interpolation computations. We tested the batch effect for each time-point and identified one or two time-points with large batch effects and removed those corrupted samples.

Sorry, does that mean batch correction should or should not be used? I would prefer to correct the effect and not throw samples away.

As for your last question, we downsample to a maximum of 15,000 UMIs. We still need to normalize by UMI count because most of the cells have less than 15,000 UMIs. This expression matrix is similar to the log-normalized expression matrix (the final difference is the log + 1).

Oh yes that makes sense, silly me.

Something like SCTransform also works well, as we describe in our newer work: https://www.biorxiv.org/content/10.1101/2020.11.12.380675v1 Best, Geoff

Thanks, I will try SCTransform.

Additionally, what do you think about using size factors which account for compositional biases? Edited my original post so you may not have seen my question.

geoffschieb commented 3 years ago

Out of over 100 runs of 10x we threw out 1, which was very different than all our other samples. This was essentially a failed reaction.

I'm not sure what you mean about size factors, can you ask again?

On Mon, Dec 14, 2020 at 2:16 PM James Ashmore notifications@github.com wrote:

Hi Geoff,

Thanks for the super fast reply!

Hi James, We used the log-normalized matrix for the OT computations. The truncated expression matrix was used in the regulatory regressions.

Okay, thanks for clarifying.

We didn't do batch correction. This allowed us to use the distance between batches as a base-line for performance in our geodesic interpolation computations. We tested the batch effect for each time-point and identified one or two time-points with large batch effects and removed those corrupted samples.

Sorry, does that mean batch correction should or should not be used? I would prefer to correct the effect and not throw samples away.

As for your last question, we downsample to a maximum of 15,000 UMIs. We still need to normalize by UMI count because most of the cells have less than 15,000 UMIs. This expression matrix is similar to the log-normalized expression matrix (the final difference is the log + 1).

Oh yes that makes sense, silly me.

Something like SCTransform also works well, as we describe in our newer work: https://www.biorxiv.org/content/10.1101/2020.11.12.380675v1 Best, Geoff … <#m-2222056408674736231>

Thanks, I will try SCTransform.

Additionally, what do you think about using size factors which account for compositional biases? Edited my original post so you may not have seen my question.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/wot/issues/87#issuecomment-744744867, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJCQYRTIDISUH3O2TGGZ33SU2FE3ANCNFSM4U3MHFYA .

jma1991 commented 3 years ago

Out of over 100 runs of 10x we threw out 1, which was very different than all our other samples. This was essentially a failed reaction.

What about in cases where you are pooling cells from multiple donors or even technologies? For example imagine a developmental series, whereby you are using multiple mouse embryos at each time point. In a conventional scRNA-seq analysis you might correct the expression matrix to ensure any donor-specific variation is removed.

I'm not sure what you mean about size factors, can you ask again?

You divide the counts matrix by the sum of UMIs for each cell, this corrects for unequal sequencing depth across the cells. However, this does not correct for compositional differences caused by unbalanced differential expression between samples. Explained here: http://bioconductor.org/books/release/OSCA/normalization.html#normalization-by-deconvolution