AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
201 stars 25 forks source link

bacterial marker set #35

Closed sarah872 closed 3 years ago

sarah872 commented 3 years ago

Hi! I was trying to re-do the Bacteria.hmm (there are over 24,000 genomes!) but I couldn't find the bit-simplify-fasta-headers that is needed in unzip_rename_cat_genes.sh in your bioinf_tools. Could you help?

AstrobioMike commented 3 years ago

Hey there, Sarah :)

Thanks for posting about this! I had changed the name of the bit script at some point to be more clear, and didn't realize it didn't match up here now. Things are updated appropriately on the SCG code page now.

Also, I have (semi-secretly) introduced a program into GToTree that automates this process, but it is kind of still in beta mode as I haven't been able to test it out that much yet – which is why I haven't put up an example/documented it yet. But if you wanted to try that, it's gtt-gen-SCG-HMMs. It uses the same conceptual steps as laid out on the SCG set wiki page (meaning scanning only for PFams with > 50% coverage of the underlying proteins, and retaining those with exactly 1 hit in the specified percentage of input genomes, default 90%). In case you want to try that :)

Thanks again!

sarah872 commented 3 years ago

I'm shook! gtt-gen-SCG-HMMs is exactly what I was looking for! This will be so useful for so many users!

In case anyone needs it too, here's my code. These two lines took me a a little less than 1 hour on 8 CPUs!

esearch -query '"txid135619"[Organism:exp]  "Complete genome"[filter] AND latest[filter] NOT anomalous[filter] AND "has annotation"[Properties]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession >genomes.txt
gtt-gen-SCG-HMMs -a genomes.txt -n 8 -o outdir

Thank you, Mike!

AstrobioMike commented 3 years ago

Awesome! Thanks for adding what you did and noting the runtime and cpus :) Out of curiosity, how many SCGs did it end up capturing?

sarah872 commented 3 years ago

The txid135619 (Oceanospirillales) were 126 genomes (minus 5 which were not found), it ran on 8 CPUs for 47 minutes and found 56 SCGs.