dib-lab / 2021-paper-metapangenomes

Other
1 stars 2 forks source link

should i add a figure for pseudogenes? #14

Open taylorreiter opened 2 years ago

taylorreiter commented 2 years ago

For genomes that had at least species-level representatives in GTDB, the largest source of error was non-coding reads being predicted as coding (Figure @fig:orpheum_fig A). We hypothesized that these reads originated from pseudogenes as these sequences would likely not be annotated as coding in the genomes from which the reads were simulated from, but may retain some k-mers contained in the database. To assess this hypothesis, we used annotation files produced by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP), which annotates pseudogenes, for the 23 genomes for which these files were available [@doi:10.1093/nar/gkw569; @doi:10.1093/nar/gkaa1105]. On average, 12.4% (SD = 13.8%) of non-coding reads that were predicted to be coding fell within pseudogenes annotated by the PGAP pipeline.

olga commented: Is there a figure for noncoding reads in pseudogenes?