epigeneticstoocean / 2017OAExp_Oysters

0 stars 0 forks source link

RSEM Index command created errorneous transcript IDs #2

Open adowneywall opened 5 years ago

adowneywall commented 5 years ago

I created RSEM index based on the eastern oyster genome gtf file where I only considered transcripts based on the GNOMON source (this only appears to <100 transcripts that were from an RNAseq source, but were created problems with downstream analysis due to some issues with how they were labelled). I used the rsem-prepare-reference command with the following code:

rsem-prepare-reference \
--gtf /shared_lab/20180226_RNAseq_2017OAExp/RNA/references/gene_annotation/KM_CV_genome_edit_Gnomon.gtf \
--star -p 8 \
/shared_lab/20180226_RNAseq_2017OAExp/RNA/references/genome/GCF_002022765.2_C_virginica-3.0_genomic.fna \
/shared_lab/20180226_RNAseq_2017OAExp/RNA/references/RSEM_gnomon/RSEM_gnomon

The issue appears to be the *transcripts.fa file generated identifies transcripts that have names like : gene9989_gene9989, when I would expect something like rna16911_gene9988. The appear to be a couple hundred of these erroneous transcripts within the .fa file, and they appear to consistently have the duplicated gene ID pattern in the name (i.e. gene100_gene100).

This created some problems downstream in salmon quant step, which mapped non-trivial numbers of reads to those transcripts. I am still checking to confirm this issue did not also cause problems with the RSEM quantification.

I'll be looking into what is creating this issue and if it is a command problem or an issue with the reference files we used.

adowneywall commented 5 years ago

Update: This issue appears to be in the gene annotation file (.gtf) file from NCBI. Upon initial examination there are 6505 transcripts belonging to 677 genes where the transcript id is labeled as a gene.

Top ten examples:

3196    NC_035780.1 Gnomon  transcript  2215606 2223571 .   -   .   transcript_id gene115; gene_id gene115; gene_name LOC111129349;
3197    NC_035780.1 Gnomon  exon    2215606 2215998 .   -   .   transcript_id gene115; gene_id gene115; gene_name LOC111129349;
3198    NC_035780.1 Gnomon  exon    2216590 2216764 .   -   .   transcript_id gene115; gene_id gene115; gene_name LOC111129349;
3199    NC_035780.1 Gnomon  exon    2216978 2217220 .   -   .   transcript_id gene115; gene_id gene115; gene_name LOC111129349;
3200    NC_035780.1 Gnomon  exon    2218775 2218835 .   -   .   transcript_id gene115; gene_id gene115; gene_name LOC111129349;
3201    NC_035780.1 Gnomon  exon    2221564 2221709 .   -   .   transcript_id gene115; gene_id gene115; gene_name LOC111129349;
3202    NC_035780.1 Gnomon  exon    2221868 2222168 .   -   .   transcript_id gene115; gene_id gene115; gene_name LOC111129349;
3203    NC_035780.1 Gnomon  exon    2222443 2222559 .   -   .   transcript_id gene115; gene_id gene115; gene_name LOC111129349;
3204    NC_035780.1 Gnomon  exon    2222660 2223211 .   -   .   transcript_id gene115; gene_id gene115; gene_name LOC111129349;
3205    NC_035780.1 Gnomon  exon    2223565 2223571 .   -   .   transcript_id gene115; gene_id gene115; gene_name LOC111129349;