How to run using Kallisto outputs?

njtourtillott commented 2 years ago

Hi,

My colleagues and I are having some trouble running DTUrtle with counts from Kallisto. We've tried adapting the presented code to work with a kallisto output from the original salmon format, however we can't seem to get it. Could you provide some sample code on how to import and format Kallisto data to work with this tool?

Thank you,

TobiTekath commented 2 years ago

Hi,

thank you for using DTUrtle and reaching out.

The use of Kallisto transcript level counts is very close to the Salmon procedure described in the vignettes. As an example we will follow the Hoffman et al. human bulk RNA-seq pre-processing vignette

Analogous to the vignette, we can create a Kallisto index with

mkdir kallisto
cd kallisto/
kallisto index -i index ../gencode.v34.transcripts.fa

We can then use our FASTQ-files for the Kallisto quantification

for i in bulk_*_1.fq.gz; do kallisto quant -i ../kallisto/index -o ../kallisto/${i/_1_val_1.fq.gz/}/ --threads 4 $i ${i/_1_val_1/_2_val_2}; done

The quantification will create an abundance.h5 and an abundance.tsv file per paired-end sample. Both can be used by DTUrtle for data import, though the .h5 read-in is supposed to be faster.

Now we can import the quantification counts into R with DTUrtle, analogous to the Hoffman et al. human bulk RNA-seq analysis vignette

# import trancript to gene mapping info
tx2gene <- import_gtf(gtf_file = "../gencode.v34.annotation.gtf")

# collect files - here we use the abundance.h5 files
files <- Sys.glob("../kallisto/bulk_*/abundance.h5")
names(files) <- gsub(".*/","",gsub("/abundance.h5","",files))

# read-in the data. We supply ignoreAfterBar=TRUE as Kallisto uses whole gencode trancript ids.
cts <- import_counts(files, type = "kallisto", tx2gene=tx2gene[,c("transcript_id", "gene_name")], ignoreAfterBar=TRUE)

When using a Gencode reference as in the example, you might be interested in changing the transcript names in the cts object, as Kallisto does not cut Gencode transcript names (unlike Salmon with the --gencode flag). This can be done with:

rownames(cts) <- gsub("\\|.*", "", rownames(cts))

I hope that answers your question.

Best, Tobi

njtourtillott commented 2 years ago

Hi Tobi,

First of all, thank you for you answer, it was really useful in setting things up!

However, we ran into another problem, and hope that you could help us again. Inputting the data works fine, and DTUrtle successfully reads all the files necessary, but an error happens when using DRIMSeq.

dturtle <- run_drimseq(counts = cts, tx2gene = tx2gene, pd=pd, id_col = "id", cond_col = "group", filtering_strategy = "bulk")

outputs

Error: The provided counts names and tx2gene names do not match.

           Counts names: ENSMUST00000178537.2, ENSMUST00000178862.2, ENSMUST00000196221.2, ENSMUST00000179664.2, ENSMUST00000177564.2 

           Tx2gene names: Gm37671-201, Gm19087-201, Gm8941-201, Gm38212-201, Gm7449-201

After doing a little digging, it seems that our kallisto output uses transcript IDs as row/counts names, which then happens in the cts object as well. However, the DRIMSeq function seems to try and match these IDs with gene names from the GTF file, instead of matching them with transcript IDs.

We manually checked that the IDs in the kallisto output are present in the GTF file, and they are. Is there a way for DTurtle to try and match the counts names with the transcript IDs of the GTF file instead of the gene names? Or are we doing something wrong?

We attached a link to our files and script to this post, in case this was not informative enough.
https://drive.google.com/drive/folders/1X-mujsIee-4Iu-YLlx-4h6EPKkxbhx_2?usp=sharing

Thank you again for your help, it is greatly appreciated!

TobiTekath commented 2 years ago

Hi,

I am glad I could already help you.

The run_drimseq function only takes the first two columns of the provided tx2gene data frame into account. As written in the documentation:

tx2gene	Data frame, where the first column consists of feature identifiers and the second column consists of corresponding gene identifiers. Feature identifiers must match with the rownames of the counts object. [...]

So, if you are fine with using transcript identifiers in your analysis, you just have to reorder the columns of the tx2gene data frame (you might want to use the move_columns_to_front() functionality for this).

If you want to switch to transcript names, you have to map the rownames of your cts object from the identifiers to the names. For example like this:

rownames(cts) <- tx2gene$transcript_name[match(rownames(cts), tx2gene$transcript_id)]

Jcazaubiel commented 2 years ago

Hey Tobi,

I work with Nick and we have run into yet another error. With your previous advice we have gotten run_drimseq to run and start to perform calculations. However, during its estimation of genewise precision we get the following error: Error message Would this simply be an issue with our data or is there something we can change in our formatting or parameters that could remedy this error?

Thank you so much for all the help!

TobiTekath commented 2 years ago

Hi,

hmm, that looks like the precision estimation fails for some genes - maybe indicating some extreme outliers in your data.

Looking at your provided data, I noticed two major points, you might want to have a look at again. I strongly believe those points will help us to better understand the error from above.

It looks like you are not using the same reference annotation in kallisto and in DTUrtle. Your kallisto counts do contain some transcripts, which are not present in the provided gencode reference. DTUrtle is able to deal with that situation by subsetting the features, but I would strongly suggest to use the same reference in both steps. Otherwise you could experience some strange artifact results because of reference interference.
Am I seeing correctly, that you are planning to only compare two samples? Or did you pool your samples of the different groups prior to kallisto analysis? Please note, that results of a 1 vs 1 comparison are statistically not usable. The statistical background operations rely on comparing the intra-group variance to the inter-group variance - which is inherently impossible in your provided setting. If you truly only have one sample per group, please do not try to use (and interpret) results of statistical tests, but be content with a more descriptive analysis.

I hope this already helps.

Best, Tobi

yuchen345 commented 1 year ago

Hello Tobi, Thanks for your nice tool firstly. I am wondering that when i use

import_counts(files, type = "salmon")

,where the transcript quantification was counted with salmon, which column of quant.sf is considered as 'count' for following analyses? TPM or raw count number? Thanks again.

TobiTekath / DTUrtle

How to run using Kallisto outputs? #5