NCBI-Hackathons / GeneFamTaxScan

A framework and family of scripts to evaluate molecular evolution (and misannotation) of gene ortholog groups, between higher taxa.
MIT License
0 stars 1 forks source link

Summary of Genome Assembly Stats #3

Closed PhyloGrok closed 6 years ago

PhyloGrok commented 6 years ago

User summary of genome assembly stats:

$  esearch -db genome -query txid9443[Organism:exp] | elink -target assembly | esummary | xtract -pattern DocumentSummary -element Organism SpeciesTaxid RefSeq_category AssemblyStatus  -block Meta -element Stat | cut -f1,2,3,4,7,8,9 > Primates

$ esearch -db genome -query txid9443[Orgn] |elink -target assembly | esummary | xtract -pattern DocumentSummary -element Id Organism SpeciesTaxid AssemblyAccession RefSeq_category AssemblyStatus -block Meta -element Stat | tr "\t" "," > Primates.csv > Primates.csv

##But find a way to insert column headers from the Meta/Stat tags

Reads out a list of "Primates" with Assembly db data from the 'Meta' block of the assembly file

(columns = Organism, SpeciesTaxid, GbUid, AssemblyAccession, RefSeq_category, AssemblyStatus, contig_count, contig_l50, contig_n50)

-Formatting output as .csv with column headers, use R make a graphical output of refseq vs non-refseq assemblies (ie. ANOVA of contig parameters between taxa levels.. R/ggplot?)

PhyloGrok commented 6 years ago

Generate stat summary comparing RefSeq vs. non-Refseq - "Primates (24)", "Rodents (20)", "Insects (92)", "Other Invertebrates (27)" (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/all/)

PhyloGrok commented 6 years ago

Could parse from the "Genome Reports" files :

ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/

PhyloGrok commented 6 years ago

Data for the following taxa:

  1. Primates: txid9443
  2. Rodents: txid9989
  3. Insects: txid6960
  4. Other Invertebrates: txid33208 (Metazoa), exluding txid6960 (Insects), and txid7742 (Vertebrates)