NCBI-Hackathons / GeneFamTaxScan

A framework and family of scripts to evaluate molecular evolution (and misannotation) of gene ortholog groups, between higher taxa.
MIT License
0 stars 1 forks source link

Scripts and framework for evaluating annotation errors for user-selected gene families, taxonomically delimited. Uses RefSeq Genomes, RefSeq Proteins as a "Standard Mean Reference" to identify outlying annotation parameters from orthologous non-RefSeq genes of interest.

GeneFamTaxScan

Steps:

1. Retrieve table (.csv) of Assembly stats from a specified Higher Taxa: (AssemblyStatsFromTaxa.sh)

bash AssemblyStatsFromTaxa.sh <NCBI tax id>

Example 9443 (primates)

Output Example: (AssemblyStats.csv)

2. Assembly Stats analysis (AssemblyStatsCompare.R)

Rscript AssemblyStatsCompare.R

Produces a viewable .pdf called Rplots.pdf

Output Example: (AssemblyStatsGraphs.md)

3. Retrieve table (.csv) of Protein stats for a specified gene ortholog group: (ProtStatsFromGeneID.sh)

bash ProtStatsFromGeneID.sh <NCBI Gene Ortholog Id> <NCBI tax id>

Example gene 29102 (Droshas), 9989 (Rodents)

Output Example: (ProtStats.csv)

4. Protein Stats analysis (ProtStatsCompare.R, reads output from ProtStatsFromGeneID.sh)

Rscript ProtStatsCompare.R <txid 1> <txid 2> <NCBI Gene Ortholog Id>

** Takes output from two different taxa (assuming same orthology group) and compares them

Output Example (Rodents/Primates) - Graphs:(ProtStatsResults.md), list of Protein seqs outside standard deviation ranges: (Prot_Abnormals.csv).

5. Retrieve Gene .fastas for a given Homologene uid, (pulls gene sequence from Assembly using chr_start,chr_stop positions)(GeneFastaFromHomlogene.sh)

bash GeneFastaFromHomologene.sh <Family name> <NCBI Homologene uid>

Example bash GeneFastaFromHomologene.sh Drosha 8293

** Note, Gene Orthologs only extends through vertebrates. Homologene has some limited coverage in invertebrate model organisms.

6. Retrieve Protein .fastas of given GeneIDs with associated RefSeq genomes. (ProtFastaFromGene.sh)

bash ProtFastaFromGene.sh <NCBI Gene uid>

7. Retrieve RefSeq Assembly .gz files for taxa of interest. (AssemblyRefseqFastasByTax.sh)

bash AssemblyRefseqFastasByTax.sh <NCBI taxid>

8. Make BLAST databases from Gene .fastas, RefSeq Protein .fastas, RefSeq Assembly .gz.

9. Retrieve Non-RefSeq Genome, Protein accessions from Taxonomy subset of interest. Compare meta-stats to "Reference" sequence SD values, find sequences outside Reference ranges, or with divergent BLAST results.

10. Visualize sequence comparisons (NCBI Genome Workbench).