Closed njtourtillott closed 2 years ago
Hi,
thank you for using DTUrtle and reaching out.
The use of Kallisto transcript level counts is very close to the Salmon procedure described in the vignettes. As an example we will follow the Hoffman et al. human bulk RNA-seq pre-processing vignette
Analogous to the vignette, we can create a Kallisto index with
mkdir kallisto
cd kallisto/
kallisto index -i index ../gencode.v34.transcripts.fa
We can then use our FASTQ-files for the Kallisto quantification
for i in bulk_*_1.fq.gz; do kallisto quant -i ../kallisto/index -o ../kallisto/${i/_1_val_1.fq.gz/}/ --threads 4 $i ${i/_1_val_1/_2_val_2}; done
The quantification will create an abundance.h5 and an abundance.tsv file per paired-end sample. Both can be used by DTUrtle for data import, though the .h5 read-in is supposed to be faster.
Now we can import the quantification counts into R with DTUrtle, analogous to the Hoffman et al. human bulk RNA-seq analysis vignette
# import trancript to gene mapping info
tx2gene <- import_gtf(gtf_file = "../gencode.v34.annotation.gtf")
# collect files - here we use the abundance.h5 files
files <- Sys.glob("../kallisto/bulk_*/abundance.h5")
names(files) <- gsub(".*/","",gsub("/abundance.h5","",files))
# read-in the data. We supply ignoreAfterBar=TRUE as Kallisto uses whole gencode trancript ids.
cts <- import_counts(files, type = "kallisto", tx2gene=tx2gene[,c("transcript_id", "gene_name")], ignoreAfterBar=TRUE)
When using a Gencode reference as in the example, you might be interested in changing the transcript names in the cts
object, as Kallisto does not cut Gencode transcript names (unlike Salmon with the --gencode flag). This can be done with:
rownames(cts) <- gsub("\\|.*", "", rownames(cts))
I hope that answers your question.
Best, Tobi
Hi Tobi,
First of all, thank you for you answer, it was really useful in setting things up!
However, we ran into another problem, and hope that you could help us again. Inputting the data works fine, and DTUrtle successfully reads all the files necessary, but an error happens when using DRIMSeq.
dturtle <- run_drimseq(counts = cts, tx2gene = tx2gene, pd=pd, id_col = "id", cond_col = "group", filtering_strategy = "bulk")
outputs
Error: The provided counts names and tx2gene names do not match.
Counts names: ENSMUST00000178537.2, ENSMUST00000178862.2, ENSMUST00000196221.2, ENSMUST00000179664.2, ENSMUST00000177564.2
Tx2gene names: Gm37671-201, Gm19087-201, Gm8941-201, Gm38212-201, Gm7449-201
After doing a little digging, it seems that our kallisto output uses transcript IDs as row/counts names, which then happens in the cts object as well. However, the DRIMSeq function seems to try and match these IDs with gene names from the GTF file, instead of matching them with transcript IDs.
We manually checked that the IDs in the kallisto output are present in the GTF file, and they are. Is there a way for DTurtle to try and match the counts names with the transcript IDs of the GTF file instead of the gene names? Or are we doing something wrong?
We attached a link to our files and script to this post, in case this was not informative enough.
https://drive.google.com/drive/folders/1X-mujsIee-4Iu-YLlx-4h6EPKkxbhx_2?usp=sharing
Thank you again for your help, it is greatly appreciated!
Hi,
I am glad I could already help you.
The run_drimseq
function only takes the first two columns of the provided tx2gene data frame into account. As written in the documentation:
tx2gene | Data frame, where the first column consists of feature identifiers and the second column consists of corresponding gene identifiers. Feature identifiers must match with the rownames of the counts object. [...] |
---|
So, if you are fine with using transcript identifiers in your analysis, you just have to reorder the columns of the tx2gene data frame (you might want to use the move_columns_to_front() functionality for this).
If you want to switch to transcript names, you have to map the rownames of your cts object from the identifiers to the names. For example like this:
rownames(cts) <- tx2gene$transcript_name[match(rownames(cts), tx2gene$transcript_id)]
Hey Tobi,
I work with Nick and we have run into yet another error. With your previous advice we have gotten run_drimseq to run and start to perform calculations. However, during its estimation of genewise precision we get the following error: Would this simply be an issue with our data or is there something we can change in our formatting or parameters that could remedy this error?
Thank you so much for all the help!
Hi,
hmm, that looks like the precision estimation fails for some genes - maybe indicating some extreme outliers in your data.
Looking at your provided data, I noticed two major points, you might want to have a look at again. I strongly believe those points will help us to better understand the error from above.
I hope this already helps.
Best, Tobi
Hello Tobi, Thanks for your nice tool firstly. I am wondering that when i use
import_counts(files, type = "salmon")
,where the transcript quantification was counted with salmon, which column of quant.sf is considered as 'count' for following analyses? TPM or raw count number? Thanks again.
Hi,
My colleagues and I are having some trouble running DTUrtle with counts from Kallisto. We've tried adapting the presented code to work with a kallisto output from the original salmon format, however we can't seem to get it. Could you provide some sample code on how to import and format Kallisto data to work with this tool?
Thank you,