AndersenLab / annotation-nf

Annotate VCF with snpeff and bcsq
1 stars 1 forks source link

Instead of common gene name, gene column of the annotation file is displaying one transcript name #3

Closed mckeowr1 closed 2 years ago

mckeowr1 commented 2 years ago

The process make_flat_file takes a modified gff for each species from the species reference dir. Ex) c_elegans.gff.

gene    chr     strand  txstart txend   wbgene  gene_name       biotype type
Y74C9A.3.1      I       -       4116    10230   WBGene00022277  homt-1  protein_coding  Transcript
Y74C9A.2a.1     I       +       11495   16837   WBGene00022276  nlp-40  protein_coding  Transcript
Y74C9A.2a.3     I       +       11495   16793   WBGene00022276  nlp-40  protein_coding  Transcript
Y74C9A.2a.2     I       +       11618   16833   WBGene00022276  nlp-40  protein_coding  Transcript
Y74C9A.2b.1     I       +       15103   16585   WBGene00022276  nlp-40  protein_coding  Transcript
Y74C9A.4a.1     I       -       17484   26781   WBGene00022278  rcor-1  protein_coding  Transcript
Y74C9A.4b.1     I       -       17487   26781   WBGene00022278  rcor-1  protein_coding  Transcript
Y74C9A.4d.1     I       -       17911   21127   WBGene00022278  rcor-1  protein_coding  Transcript
Y74C9A.4c.1     I       -       17911   26643   WBGene00022278  rcor-1  protein_coding  Transcript

So the gene col actually contains the transcript name.

This file is joined to an early version of the flat file

add_gene <- cleaned_flat_file %>%
  dplyr::left_join(name_key, by = c( "GENE" = "wbgene")) %>%
  dplyr::rename("WORMBASE_ID" = "GENE") %>% 
  dplyr::rename("GENE" = "gene_name")

All of this should be fine. The problem is that the gene column of this gff doesn't give the gene name if there is no common name ex (mex-3) it gives the transcript name.

My quick fix for this is to just edit the transcript name down to the gene name.

According to Wormbase nomenclature:

The CDS transcript name is "derived from the same Sequence Name as their parent Gene object, so the gene F38H4.7 has a CDS called F38H4.7." Isoforms of a transcript are denoted like this:

"The gene bli-4 has 10 known CDS isoforms, called K04F10.4a, K04F10.4b, K04F10.4c, K04F10.4d, K04F10.4e, K04F10.4f, K04F10.4g, K04F10.4h, K04F10.4i, and K04F10.4j."

Heres the fix and some test data

gene    chr strand  txstart txend   wbgene  gene_name   biotype type
Y48G1C.12.1 I   +   47461   49860   WBGene00044345  Y48G1C.12.1 protein_coding  Transcript
Y48G1C.4a.1 I   +   49921   54426   WBGene00021677  pgs-1   protein_coding  Transcript
Y48G1C.4b.1 I   +   52370   54360   WBGene00021677  pgs-1   protein_coding  Transcript
Y48G1C.5.1  I   -   55293   64066   WBGene00021678  Y48G1C.5.1  protein_coding  Transcript
W05F2.4f.1  I   -   3375099 3402811 WBGene00021036  W05F2.4f.1  protein_coding  Transcript
name_key <- data.table::fread("test.tsv") %>%
  dplyr::select(wbgene, gene_name)

name_key  %>% mutate(gene_name = ifelse(str_detect(gene_name, "\\."), str_extract(gene_name, pattern = "\\w+\\.\\d"), gene_name)) %>% unique()