Arcadia-Science / prehgt

A pipeline for lightweight screening of Eukaryotic genomes and transcriptomes for recent HGT
MIT License
12 stars 6 forks source link

converting the `download/*_genome.csv` files into a species list #58

Open taylorreiter opened 5 months ago

taylorreiter commented 5 months ago

PreHGT runs at the genus level to pull in (pseudo) pangenome information, which is used to estimate contamination vs. real transfer events.

Right now, we don't do a good job of reporting how many/which species are represented for each genera. Below I include some code I recently used to get the species (organism name) information from ncbi based on the genome accession (`GCA/GCF).

I ran this on all files matching download/*_genome.csv

Install tools

conda install -c conda-forge ncbi-datasets-cli jq

collect genome accessions without csv headers

for infile in *csv
do
  cat $infile | tail -n +2 >> genomes.csv
done

get species (organism name)

while IFS= read -r accession
do
    datasets summary genome accession "$accession" | jq -r '.reports[] | [.accession, .organism.organism_name] | @csv'
done < <(awk -F, 'NR>1 {match($2, /(GCA|GCF)_[0-9]+\.[0-9]+/, m); print m[0]}' genomes.csv)