cole-trapnell-lab / cicero-release

https://cole-trapnell-lab.github.io/cicero-release/
MIT License
56 stars 14 forks source link

scale of the normalized the gene activity matrix #23

Closed crazyhottommy closed 5 years ago

crazyhottommy commented 5 years ago

Hi,

In general, what's the scale of the matrix?

> range(atac_mat)
[1] 0.0000000 0.9999923

I asked the Seurat V3 author because I am using it for label transferring from scRNAseq data.

That is a weird distribution for the normalized values, the cortex ATAC data we looked at (generated by the Shendure/Trapnell labs) has a distribution more similar to scRNA (see attached, title of the plot says RNA but it’s ATAC data)

Does cicero normalize the value somehow? or should I use the un-normalized values?

Thanks, Tommy

image001
crazyhottommy commented 5 years ago

I read the Mol Cell paper and see the co-accessibility score is from 0 to 1 as you are calculating the correlation. Not sure where did the Seurat group get the data.

crazyhottommy commented 5 years ago

I went to http://krishna.gs.washington.edu/content/members/ajh24/mouse_atlas_data_release/activity_score_matrices/ . and saw there are activity scores that are binarized and quantitative.

In the tutorial, you binarized the matrix

# read in matrix data using the Matrix package
indata <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx") 
# binarize the matrix
indata@x[indata@x > 0] <- 1

If I want to get the quantitative gene activity score, should I not binarize it?

Thanks.

hpliner commented 5 years ago

No, the default output explained in the tutorial will be the quantitative scores. We generally binarize the input matrix because given the expected sparsity of the data and the fact that there should generally be only two possible reads from a given site (diploid genome), we expect most values > 0 to be missed duplicates.

On Tue, Apr 2, 2019 at 1:09 PM Ming Tang notifications@github.com wrote:

I went to http://krishna.gs.washington.edu/content/members/ajh24/mouse_atlas_data_release/activity_score_matrices/ . and saw there are activity scores that are binarized and quantitative.

In the tutorial, you binarized the matrix

read in matrix data using the Matrix packageindata <- Matrix::readMM("filtered_peak_bc_matrix/matrix.mtx") # binarize the matrixindata@x[indata@x > 0] <- 1

If I want to get the quantitative gene activity score, should I not binarize it?

Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cole-trapnell-lab/cicero-release/issues/23#issuecomment-479099106, or mute the thread https://github.com/notifications/unsubscribe-auth/AHJQ138wxIoBvFyW0e37bUQTAawSFiZSks5vc463gaJpZM4cRhEr .

crazyhottommy commented 5 years ago

Thanks. I followed the tutorial exactly and used the 10x pbmc 10k data, I then checked the final normalized gene activity score matrix, and it ranges from 0-1.

could you please confirm the range or distribution of the gene activity score as shown in the histogram in my previous message?

Seurat V3 is using the counts in the genebody + 2kb upstream as a proximate of the gene activity. I want to compare their methods and cicero.

Thanks very much.

hpliner commented 5 years ago

Apologies for the very long delay in replying. The output gene activity scores from Cicero are normalized and so will be quantitative values from 0 to 1. For the mouse atlas project, we did a post processing step to convert values to a more 'fpkm-like' scale, which is described in the methods of that paper (https://www.cell.com/cell/fulltext/S0092-8674(18)30855-9#secsectitle0085) in the section titled 'Computing Gene Activity Scores'.

crazyhottommy commented 5 years ago

No problem. Many thanks for the clarification.