Scripts and framework for evaluating annotation errors for user-selected gene families, taxonomically delimited. Uses RefSeq Genomes, RefSeq Proteins as a "Standard Mean Reference" to identify outlying annotation parameters from orthologous non-RefSeq genes of interest.
bash AssemblyStatsFromTaxa.sh <NCBI tax id>
Example 9443 (primates)
Output Example: (AssemblyStats.csv)
Rscript AssemblyStatsCompare.R
Produces a viewable .pdf called Rplots.pdf
Output Example: (AssemblyStatsGraphs.md)
bash ProtStatsFromGeneID.sh <NCBI Gene Ortholog Id> <NCBI tax id>
Example gene 29102 (Droshas), 9989 (Rodents)
Output Example: (ProtStats.csv)
Rscript ProtStatsCompare.R <txid 1> <txid 2> <NCBI Gene Ortholog Id>
** Takes output from two different taxa (assuming same orthology group) and compares them
Output Example (Rodents/Primates) - Graphs:(ProtStatsResults.md), list of Protein seqs outside standard deviation ranges: (Prot_Abnormals.csv).
bash GeneFastaFromHomologene.sh <Family name> <NCBI Homologene uid>
Example bash GeneFastaFromHomologene.sh Drosha 8293
** Note, Gene Orthologs only extends through vertebrates. Homologene has some limited coverage in invertebrate model organisms.
bash ProtFastaFromGene.sh <NCBI Gene uid>
bash AssemblyRefseqFastasByTax.sh <NCBI taxid>