Matching Transcript ID to Gene ID - Lots of unmatched transcripts

aspencoyle commented 3 years ago

I'm looking to take the tables of transcript IDs produced by my DESeq2 analysis (example available here and match them with the accession IDs for version 3.0 of C. bairdii transcriptome.

I'm using this table to match genes to transcripts., found under BLASTx Annotation for cbai_hemat_transcriptome_v3.0 on the Genomic Resources page.

However, when I do, I get a lot of unmatched transcripts. Of the 1022 significantly different transcripts in my ambient vs. low treatments, 641 of them aren't matched to a particular gene. During our Science Hour chat, I thought I remembered hearing that it's unusual for a transcript to not be matched to a gene. Am I just misremembering? And if not, what's going on here?

Don't think my code is causing the issue, but posting it below to be safe:

# Filepath that leads to DESeq2 output file
deseq_output_filepath <- "C:/Users/acoyl/Documents/GitHub/hemat_bairdii_transcriptome/graphs/amb_v_low_day02/Amb_vsLow_DEGlist_wcols.txt"
# Filepath that leads to transcript ID/gene ID table
blast_filepath <- "C:/Users/acoyl/Documents/GitHub/hemat_bairdii_transcriptome/data/cbai_hemat_diamond_blastx_table_transcriptome_v3.0.txt"

# Read output file into R
transcript_data <- read.table(deseq_output_filepath, 
                              header = TRUE, sep = "\t")
# Transcript IDs are rownames - move them into first column
transcript_data <- tibble::rownames_to_column(transcript_data, 
                                              "Transcript_ID")
# Read BLAST data into R
blast_data <- read.table(blast_filepath, header = FALSE,
                         sep = "\t")
# Columns have no names - add names for first two columns
colnames(blast_data)[1:2] <- c("Transcript_ID", "Gene_ID")

# Turn the first two columns of BLAST data into a Transcript ID/Gene ID key
blastkey <- blast_data %>%
  select(Transcript_ID, Gene_ID)

# Add Gene ID column to transcript data, using Transcript ID column to match
transcript_data <- left_join(transcript_data, blastkey, by = "Transcript_ID")

# Select only the Transcript ID and Gene ID columns
transcript_key <- transcript_data[,c("Transcript_ID", "Gene_ID")]
length(transcript_key$Transcript_ID)
sum(is.na(transcript_key$Gene_ID))

EDITED: Added code block formatting.

sr320 commented 3 years ago

What percentage of the entire transcriptome is annotated based on the annotation file you are using--- and how does that percentage compare to your DEG list?

aspencoyle commented 3 years ago

For my 4 DEG lists combined, 1061/2930 differently-expressed genes are annotated (36.2%)

The annotation file I'm using here - labeled as BlastX annotation for the transcriptome has 66,596 transcripts, and all are annotated.

The whole transcriptome (linked on the Genomic Resources page as cbai_transcriptome_v3.0.fasta) has 344,944 sequences, so 19.4% are annotated.

I'm also using transcriptome v3.0 (pooled libraries only) rather than transcriptome v2.0 (pooled libraries + individual libraries), as my computer couldn't handle creating a Kallisto index for v2.0 - maybe that could be the source of some of the disparity?

sr320 commented 3 years ago

Seems like only 19% of transcriptome is annotated-- you got 36% so you are doing good. Did you annotate v3? If not you should to get aware of ins and outs.

aspencoyle commented 3 years ago

Alright, great! And nope, I didn't annotate v3 myself - I'll start on that using Trinotate!

RobertsLab / resources

Matching Transcript ID to Gene ID - Lots of unmatched transcripts #1057