Is there any way to retrieve / assign aggregated cluster IDs to original cell labels?

cole-trapnell-lab / cicero-release

https://cole-trapnell-lab.github.io/cicero-release/

MIT License

56 stars 14 forks source link

Is there any way to retrieve / assign aggregated cluster IDs to original cell labels? #77

Closed HaglundA closed 2 years ago

HaglundA commented 3 years ago

Hi,

My question is basically the title above. Is it possible, following make_cicero_cds to go back to the original cell IDs and match them to the clusters formed in the cicero command?

Thanks!

hpliner commented 3 years ago

Hi, The short answer is that no, there isn't a way to do this in the current implementation. Because of the bagging procedure used, the mapping isn't one-to-one (cells can be in multiple groups), so there isn't a way to add it as a column to the pData for example. I will leave this issue open though as this is a request I've had before. Perhaps when I get some time I can add an option to output this.

HaglundA commented 3 years ago

Hi, thanks for your answer! I see, thanks for clarifying. I had a separate question which I thought wasn't worth creating a new issue for;

In the publication, this is described; "Accessibility counts are then summed across all cells in a group to create count matrix C"

I'm trying to get a better understanding of the aggregation procedure. Looking at the source code for make_cicero, I'm trying to understand where the aggregation/summing of counts within the regions is done after bagging? Thanks in advance!

On a separate point, how come the creation of the cds item takes the ATAC bam file and the cell barcode file as opposed to just rownames(indata) and colnames(indata)?

Sorry for the multiple questions and thank you in advance!! :)

hpliner commented 2 years ago

Hello, the aggregation happens here: https://github.com/cole-trapnell-lab/cicero-release/blob/307441edc0a61d7037b7d346f10173b26783a845/R/runCicero.R#L137 Basically it creates a T/F mask based on the cells to be included and uses matrix multiplication to do the summation (T becomes 1 and F becomes 0 so the cells not included in the mask row don't count towards the total)

I'm not sure what you mean by the bam file, as the cicero input doesn't require that... can you point me towards where you're seeing that?

HaglundA commented 2 years ago

Hi,

Ah I see, that's what I had imagined. Thanks for clarifying!!

Sorry, I misspelled, I meant the bed file! https://cole-trapnell-lab.github.io/cicero-release/docs/

peakinfo <- read.table("filtered_peak_bc_matrix/peaks.bed") names(peakinfo) <- c("chr", "bp1", "bp2") peakinfo$site_name <- paste(peakinfo$chr, peakinfo$bp1, peakinfo$bp2, sep="_") row.names(peakinfo) <- peakinfo$site_name

saorisakaue commented 2 years ago

Hi, thank you a lot for the great implementation of Cicero! I just wanted to jump in to this thread– I was also wondering if this feature of outputting the aggregated cell information in make_cicero_cds can be implemented anytime soon. I think it will be a very useful function when we want to have the same set of aggregated cells in multi-ome data. If I am understanding correctly, the mask matrix has that information?

Thanks in advance! Saori

hpliner commented 2 years ago

Hi,

Ah I see, that's what I had imagined. Thanks for clarifying!!

Sorry, I misspelled, I meant the bed file! https://cole-trapnell-lab.github.io/cicero-release/docs/

peakinfo <- read.table("filtered_peak_bc_matrix/peaks.bed") names(peakinfo) <- c("chr", "bp1", "bp2") peakinfo$site_name <- paste(peakinfo$chr, peakinfo$bp1, peakinfo$bp2, sep="_") row.names(peakinfo) <- peakinfo$site_name

Sorry, replying to this very late. The bed file bit is only because with this example dataset I didn't have the position info saved anywhere else. If you have the positions in the chr1_202921_203481 format someplace else, you can certainly sub that in and forget about the bed!

hpliner commented 2 years ago

Hi, thank you a lot for the great implementation of Cicero! I just wanted to jump in to this thread– I was also wondering if this feature of outputting the aggregated cell information in make_cicero_cds can be implemented anytime soon. I think it will be a very useful function when we want to have the same set of aggregated cells in multi-ome data. If I am understanding correctly, the mask matrix has that information?

Thanks in advance! Saori

This is now implemented in the latest version. If you set return_agg_info to TRUE, the aggregation info is output in a data.frame along with the cicero_cds:

cicero_cds_temp <- make_cicero_cds(input_cds,
                                   reduced_coordinates = tsne_coords,
                                   silent = TRUE,
                                   summary_stats = c("num_genes_expressed"),
                                   return_agg_info = TRUE,
                                   size_factor_normalize = FALSE)
cicero_cds2 <- cicero_cds_temp[[1]]
agg_info <- cicero_cds_temp[[2]]

Give it a try!