TobiTekath / DTUrtle

Perform differential transcript usage (DTU) analysis of bulk or single-cell RNA-seq data. See documentation at:
https://tobitekath.github.io/DTUrtle
GNU General Public License v3.0
17 stars 3 forks source link

No common cellnames in Seurat and transcript level counts #14

Closed LacquerHed closed 1 year ago

LacquerHed commented 1 year ago

Hi, I am using the following code and consistently getting the same error:

library(DTUrtle)
biocpar <- BiocParallel::MulticoreParam(10)

tx2gene <- import_gtf(gtf_file = "/Users/rbronste/Downloads/gencode.vM24.annotation.gtf")

head(tx2gene, n=3)

tx2gene$gene_name <- one_to_one_mapping(name = tx2gene$gene_name, id = tx2gene$gene_id)

tx2gene$transcript_name <- one_to_one_mapping(name = tx2gene$transcript_name, id = tx2gene$transcript_id)

tx2gene <- move_columns_to_front(df = tx2gene, columns = c("transcript_name", "gene_name"))

head(tx2gene, n=5)

list.files("/Users/rbronste/Documents/DTUrtle/alevin")

files <- Sys.glob("/Users/rbronste/Documents/DTUrtle/alevin/alevin_output_*/alevin/quants_mat.gz")
names(files) <- basename(gsub("/alevin/quants_mat.gz", "", files))

cts_list <- import_counts(files = files, type = "alevin", tx2gene = tx2gene)

cts <- combine_to_matrix(tx_list = cts_list, cell_extension_side = "prepend")

dim(cts)

tiss <- combine_to_matrix(tx_list = cts_list, seurat_obj = Data.combined.RNA.sct,cell_extensions = c("WT_RNA", "KO_RNA"), tx2gene = tx2gene, cell_extension_side = "prepend")
Found overall duplicated cellnames. Trying cellname extension per sample.
Map extensions:
    alevin_output_sc151 --> 'WT_RNA_'
    alevin_output_sc161 --> 'KO_RNA_'

Merging matrices
Error: No common cellnames in Seurat and transcript level counts. Did you try specifyin a cellname extension?

Not quite sure why its not seeing common names. Thanks.

TobiTekath commented 1 year ago

Hi @LacquerHed,

thank you very much for using DTUrtle and reaching out.

DTUrtle appears to be unable to match the cellnames between the samples in the cts_list and your Seurat object.

To better undestand what is going on, could you provide the output of each of the following commands:

head(Cells(Data.combined.RNA.sct))
lapply(cts_list, function(x) head(rownames(x)))

Basically, DTUrtle expects that the CellIDs in the sample alevin_output_sc151 are of the format "WTRNABarcode", and for alevin_output_sc161 of the format "KORNABarcode".

Can you confirm that that is the case?

Best, Tobi

LacquerHed commented 1 year ago

Thanks for the reply!

Yes it seems when I run the commands you suggested I do get different outputs, not quite sure why this would be as I preprocessed the Cellranger output as the tutorial suggested. It seems that I can retrieve the barcodes from the alevin output (third example below) but as colnames.

> head(Cells(Data.combined.RNA.sct))
[1] "WT_RNA_AAACCCAAGCAAGTCG-1" "WT_RNA_AAACCCAAGCGAGAAA-1" "WT_RNA_AAACCCAAGCTGTCCG-1" "WT_RNA_AAACCCAAGGCGCTTC-1"
[5] "WT_RNA_AAACCCAAGGCTATCT-1" "WT_RNA_AAACCCAAGGTACCTT-1"
> lapply(cts_list, function(x) head(rownames(x)))
$alevin_output_sc151
[1] "4933401J01Rik-201" "Gm26206-201"       "Xkr4-203"          "Xkr4-202"          "Xkr4-201"         
[6] "Gm18956-201"      

$alevin_output_sc161
[1] "4933401J01Rik-201" "Gm26206-201"       "Xkr4-203"          "Xkr4-202"          "Xkr4-201"         
[6] "Gm18956-201"      
> lapply(cts_list, function(x) head(colnames(x)))
$alevin_output_sc151
[1] "AAGCAGTGGTATCAAC" "CCTAACCAGGTTCTTG" "CATGGTATCATCGGGC" "CTTCGGTTCGGTCGAC" "AAGACAAGTGTGTCCG" "GCTTTCGGTTCTCTCG"

$alevin_output_sc161
[1] "AAGCAGTGGTATCAAC" "GAGGGATGTTCGTTCC" "TAGGAGGAGCATAGGC" "GACAGCCAGACGCCAA" "GGTGGCTAGCGGGTAT" "TGCTGAACACTCATAG"
LacquerHed commented 1 year ago

Any help would be greatly appreciated, it seems the issue is that the rownames of the cts_list are the transcript IDs while the Cells(Data.combined.RNA.sct) are the barcodes, not sure how can fix this? I tried with multiple Seurat versions and get the same error. Thanks!

TobiTekath commented 1 year ago

Please excuse my response time, and thank you for providing me with the requested information. As you already guessed, I was actually interested in the colnames of your cts_list, thank you for spotting my mistake there.

You see, your Seurat cellbarcodes (e.g. "WT_RNA_AAACCCAAGCAAGTCG-1") have to match with your cts_list barcodes (e.g. AAGCAGTGGTATCAAC), which is not the case if "only" "WTRNA" or "KO_RNA" are prepended.

Somehow your cells have both an prepended and appended "uniqueness" string attached, which probably is a bit redundant. Just so we do not make a mistake here, could you quickly confirm that, by running `table(gsub(".*-","",Cells(Data.combined.RNA.sct))) only "1"s and "2"s are retourned (and with the abundace of cells in the two cts_lists, respectively)? If, and only if so, we can quickly fix the error with the following commands:

#we fix the cellbarcodes in the cts_list
colnames(cts_list[[1]]) <- paste0("WT_RNA_", Cells(cts_list[[1]]),  "-1")
colnames(cts_list[[2]]) <- paste0("KO_RNA_", Cells(cts_list[[2]]),  "-2")

#followed by a simpler variant of your previous command
tiss <- combine_to_matrix(tx_list = cts_list, seurat_obj = Data.combined.RNA.sct, tx2gene = tx2gene)

If the table() command did also return some other digits, there must be some slight error with the processing beforhand, as we do not expected duplicated barcodes from the same sample.

I hope you undestand why DTUrtle is not able to perform the requested command and hopefully your problem is solved now.

Best, Tobi

LacquerHed commented 1 year ago

Hi Toby,

Thanks for the tips. So the output of this command table(gsub(".*-","",Cells(Data.combined.RNA.sct))) is the following:

1 32961

I used your commands to fix the cell barcodes and now get this:

> head(Cells(Data.combined.RNA.sct)) [1] "WT_RNA_AAACCCAAGCAAGTCG-1" "WT_RNA_AAACCCAAGCGAGAAA-1" "WT_RNA_AAACCCAAGCTGTCCG-1" "WT_RNA_AAACCCAAGGCGCTTC-1" "WT_RNA_AAACCCAAGGCTATCT-1" [6] "WT_RNA_AAACCCAAGGTACCTT-1"

> lapply(cts_list, function(x) head(colnames(x))) $alevin_output_sc151 [1] "WT_RNA_AAGCAGTGGTATCAAC-1" "WT_RNA_CCTAACCAGGTTCTTG-1" "WT_RNA_GCTTTCGGTTCTCTCG-1" "WT_RNA_CATGGTATCATCGGGC-1" "WT_RNA_GAGTCATCAGGCACAA-1" [6] "WT_RNA_AAGACAAGTGTGTCCG-1" $alevin_output_sc161 [1] "KO_RNA_AAGCAGTGGTATCAAC-1" "KO_RNA_GAGGGATGTTCGTTCC-1" "KO_RNA_TAGGAGGAGCATAGGC-1" "KO_RNA_AGCTACACACTGAATC-1" "KO_RNA_GTGCGTGCAGGATGAC-1" [6] "KO_RNA_GGTGGCTAGCGGGTAT-1"

However when I run the final matrix command I still get the same error:

> tiss <- combine_to_matrix(tx_list = cts_list, seurat_obj = Data.combined.RNA.sct, tx2gene = tx2gene)
Trying to infer cell extensions from Seurat object
Map extensions:
    alevin_output_sc151 --> '_WT_RNA'
    alevin_output_sc161 --> '_KO_RNA'
    Merging matrices
Error: No common cellnames in Seurat and transcript level counts. Did you try specifyin a cellname extension?

Also Im not sure if I am supposed to do the metadata steps earlier in the tutorial with cts if Im using a Seurat object.

TobiTekath commented 1 year ago

Ah, we are slowly getting closer to a solution.

I forgot, that DTUrtle is really trying hard to guess the cellname extension from the Seurat object. So it still tries to estimate one, when not told otherwise. But the actual solution to your problem should now be quite easy:

Could you try the following commands?

# just add the appended cellname extension
cts_list <- lapply(cts_list, function(x){
colnames(x) <- paste0(colnames(x),  "-1")
return(x)
} 

# let DTUrtle do the rest
tiss <- combine_to_matrix(tx_list = cts_list, seurat_obj = Data.combined.RNA.sct,cell_extensions = c("WT_RNA", "KO_RNA"), tx2gene = tx2gene, cell_extension_side = "prepend")

I hope this finally fixes the problem.

What exactly do you mean with "metadata steps"? The tx2gene ID to Symbol mapping is recommended when using gene symbols in your analysis, regardless of using a Seurat object or not.

Best, Tobi

LacquerHed commented 6 months ago

Hi, had a really quick question. I am trying to figure out a way to integrate the matrix that contains both the transcript level counts and the total counts (regular matrix used in for instance Seurat analysis), wondering if there is a good way to do this?

Thanks! Rob.


Robert Bronstein, Ph.D. Postdoctoral Fellow Mallipattu Lab Stony Brook Medicine 101 Nicolls Road Stony Brook, NY 11794 (201)240-2534

On Jul 26, 2023, at 6:52 PM, Tobias Tekath @.***> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Ah, we are slowly getting closer to a solution.

I forgot, that DTUrtle is really trying hard to guess the cellname extension from the Seurat object. So it still tries to estimate one, when not told otherwise. But the actual solution to your problem should now be quite easy:

Could you try the following commands?

just add the appended cellname extension

cts_list <- lapply(cts_list, function(x){ colnames(x) <- paste0(colnames(x), "-1") return(x) }

let DTUrtle do the rest

tiss <- combine_to_matrix(tx_list = cts_list, seurat_obj = Data.combined.RNA.sct,cell_extensions = c("WT_RNA", "KO_RNA"), tx2gene = tx2gene, cell_extension_side = "prepend")

I hope this finally fixes the problem.

What exactly do you mean with "metadata steps"? The tx2gene ID to Symbol mapping is recommended when using gene symbols in your analysis, regardless of using a Seurat object or not.

Best, Tobi

— Reply to this email directly, view it on GitHubhttps://github.com/TobiTekath/DTUrtle/issues/14#issuecomment-1652647158, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ATO5CXLM2VZOSAN34RN6JDLXSGNRXANCNFSM6AAAAAA2TDHAC4. You are receiving this because you were mentioned.Message ID: @.***>

This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by e-mail and destroy all copies of the original.