Putnam-Lab / Lab_Management

13 stars 7 forks source link

Expression Comparison Meeting: Developing QC step for bioinformatic workflow when we encounter "name space" problems prior to downstream annotation analyses #28

Closed daniellembecker closed 2 years ago

daniellembecker commented 3 years ago

All work and notes will be updated in Danielle's GO Term Enrichment Script.

General statement of the problem:

Danielle encountered naming issues when she was at her final Kegg enrichment analysis when trying to combine her functional annotation file and her KofamScan output Kegg ontology file. Many of the gene_id names included novel or split nomenclature in the gene_id that was not matching across files.

After we discovered the issue we knew the main inconsistencies and were able to check them with code:

The naming structure and format for gene_ids were checked in the structural annotation file (.gff3), functional annotation file (.txt), and Kegg ontology file (.tsv).

#Check Gene name format and number

unique(GO.data$gene_id)
novel1 <- Annot %>%
  filter(str_detect(gene_id, "novel"))

split1 <- Annot %>%
  filter(str_detect(gene_id, "split"))

remain1 <- Annot %>%
  filter(!str_detect(gene_id, "novel")) %>% filter(!str_detect(gene_id, "split"))

nrow(novel1)+nrow(split1)+nrow(remain1) #should =27439

#There are two gene ID formats in the Pver_genome_annotation file
#Pver_gene_g#
#Pver_novel_gene_#_5de57afd
#Check Gene name format and number

novel <- GFF3 %>%
  filter(str_detect(gene_id, "novel"))

split <- GFF3 %>%
  filter(str_detect(gene_id, "split"))

remain <- GFF3 %>%
  filter(!str_detect(gene_id, "novel")) %>% filter(!str_detect(gene_id, "split"))

nrow(novel)+nrow(split)+nrow(remain) #should =27439

#There are 3 formats in the gff3
#Pver_gene_g#
#Pver_gene_novel_gene_#_5de57afd
#Pver_gene_split_gene_g1652-g72
#Check Gene name format and number

novel2 <- KFS.KO %>%
  filter(str_detect(gene_id, "novel"))
length(unique(novel2$gene_id))

split2 <- KFS.KO %>%
  filter(str_detect(gene_id, "split"))
length(unique(split2$gene_id))

remain2 <- KFS.KO %>%
  filter(!str_detect(gene_id, "novel")) %>% filter(!str_detect(gene_id, "split"))
length(unique(remain2$gene_id))

#There are 3 formats in the Pver_KO_annot.tsv.gz
#Pver_g1.t2
#Pver_novel_model_1_5de57afd
#Pver_split_gene_g408-g408.t1-m18

Since there was so much variation, we decided it would be best to create some sort of QC step for our bioinformatic workflows that can mediate these issues more effectively and help us identify the inconsistencies between gene_ids early on before downstream analyses.

Meeting notes from 20210917 (Jill, Hollie, Danielle, and Megan):

goal is to check functional annotation with gff3 and KOfamscan output nomenclature for gene IDs across all files

step 1: confirm names of columns that contain gene IDs (manually changed column names to gene_IDs)

step 2: find the inconsistencies between the files (ex: identify the total number of expected genes; row number in the gff) sanity check with NCBI or manuscript total number of genes

          - length (unique) genes; now are the function from all files 
          - somehow organize gene ids into categories (ex: Pver_gene; Pver_split; Pver_novel)
          - some sort of match or total table
          - look at raw data for the syntax of the names to find general inconsistencies 
          - take gff3 and functional annotation and full_join those; repeat with any subsequent files you want to join
#testing how to see inconsistencies in gene_ID formats across gff3 and func. annot file

test.two.files <- full_join(GFF3, Annot, by = "gene_id")

#conclusion: we have x more genes than 28,153 when we expect 27,439, some unique gene ID names in annot that do not exist in genes

#testing how to see inconsistencies in gene_ID formats across gff3, func. annot, against KEGG terms file

test.three.files <- full_join(test.two.files, Kegg.name, by = "gene_id")

#conclusion: if 28,153 + 27,439, does not equal 55,592 so all the names are different from test two files to test three files for KEGG terms 

Next steps:

goal is to find the most effective way to join three files and then identify what the offsets are for the three files to each other, Megan suggested python blocking/grouping with key identifiers (Pver_g[0-9]) then you will know how many of those kinds of expressions are found in each of your lists; will tell you if they are fully missing

can sort everything into 'buckets' so that we can know how many general patterns there are before moving forward