BD2KGenomics / toil-rnaseq

UC Santa Cruz Computational Genomics Lab's Toil-based RNA-seq pipeline
Apache License 2.0
40 stars 10 forks source link

Where is transcriptome_hg38_gencodev23.fasta from? #170

Closed wangshun1121 closed 5 years ago

wangshun1121 commented 5 years ago

Dear Dr. Vivian:

In your Method introduction, you mentioned Kallisto index was made using following command:

sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/kallisto index -i hg38.gencodeV23.transcripts.idx transcriptome_hg38_gencodev23.fasta

Now, I am trying to make similar index using salmon. But I wonder where istranscriptome_hg38_gencodev23.fasta from? Is it gencode.v23.transcripts.fa.gz , or gencode.v23.pc_transcripts.fa.gz , or otherwise you made by yourself with other tools such as gffread?

This issue is essential, since most RNA seq data sets on TCGA are from ployA libraries. The reference file should include sequences with polyA such as coding genes, immunoglobulins, and even some pseudogenes, while sequences without polyA such as lncRNAs should be removed. gencode.v23.pc_transcripts.fa, in my opinion, is suitable for reference, rather than full transcript sets.

There is no transcriptome_hg38_gencodev23.fasta on GENCODE's ftp, then I wonder where is this file from?

wangshun1121 commented 5 years ago

gencode.v23.transcripts.fa.gz: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.transcripts.fa.gz

gencode.v23.pc_transcripts.fa.gz: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/gencode.v23.pc_transcripts.fa.gz

wangshun1121 commented 5 years ago

You mentioned that:

Gencode v23 annotations were downloaded from Gencode. Comprehensive gene annotation (Regions=CHR) GTF was used to generate all additional reference input data.

However, using your Kallisto index, I tried to quantile transcript expression data. Here I'd like to share some transcripts are in Kallisto index, but not included in CHR comprehensive gene annotation:

ENST00000383212.4
ENST00000211372.9
ENST00000383108.6
and so on...

These transcripts can be seen in gencode.v23.chr_patch_hapl_scaff.annotation.gtf.gz

There are totally 15208 transcripts included in Kallisto index, but not included in gene annotation file.

jvivian commented 5 years ago

Dear @wangshun1121 ,

Thank you for your inquiry. I apologize this was not properly explained in the paper — I'll add a note in the documentation which reflects the following reply I made in an email a few months ago:

But there are also 15,208 extra transcripts in Xena set of identifiers while not in gencode v23 which I don’t understand how it is possible.

"When I was preparing for the large-scale compute I had a request to use the (comprehensive) superset of Gencode v23 which contains additional transcripts (second row) — in hindsight, I likely wouldn't have done this because it means the transcript results between RSEM and Kallisto are not 1:1 and it's confusing in general. To get the transcriptome to work with Kallisto I also had to remove the overlapping genes from the PAR locus (chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415). The reference genome I used also had the alt alleles removed for mapping reasons.

If you check the transcripts that don't match the Gencode v23 CHR GTF file, they should almost all be alt alleles. "

There is no transcriptome_hg38_gencodev23.fasta on GENCODE's ftp, then I wonder where is this file from?

I made it myself with a BedToFasta tool that I've now replaced with TopHat2's "GTFtoFasta" tool, which you can see in the workflow's input generation here

I hope this answers your question and I apolgize for the confusion caused. Let me know if you have any other questions I can help with.