Closed sarah872 closed 3 years ago
Hey there, Sarah :)
Thanks for posting about this! I had changed the name of the bit script at some point to be more clear, and didn't realize it didn't match up here now. Things are updated appropriately on the SCG code page now.
Also, I have (semi-secretly) introduced a program into GToTree that automates this process, but it is kind of still in beta mode as I haven't been able to test it out that much yet – which is why I haven't put up an example/documented it yet. But if you wanted to try that, it's gtt-gen-SCG-HMMs
. It uses the same conceptual steps as laid out on the SCG set wiki page (meaning scanning only for PFams with > 50% coverage of the underlying proteins, and retaining those with exactly 1 hit in the specified percentage of input genomes, default 90%). In case you want to try that :)
Thanks again!
I'm shook! gtt-gen-SCG-HMMs
is exactly what I was looking for! This will be so useful for so many users!
In case anyone needs it too, here's my code. These two lines took me a a little less than 1 hour on 8 CPUs!
esearch -query '"txid135619"[Organism:exp] "Complete genome"[filter] AND latest[filter] NOT anomalous[filter] AND "has annotation"[Properties]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession >genomes.txt
gtt-gen-SCG-HMMs -a genomes.txt -n 8 -o outdir
Thank you, Mike!
Awesome! Thanks for adding what you did and noting the runtime and cpus :) Out of curiosity, how many SCGs did it end up capturing?
The txid135619
(Oceanospirillales) were 126 genomes (minus 5 which were not found), it ran on 8 CPUs for 47 minutes and found 56 SCGs.
Hi! I was trying to re-do the Bacteria.hmm (there are over 24,000 genomes!) but I couldn't find the
bit-simplify-fasta-headers
that is needed inunzip_rename_cat_genes.sh
in yourbioinf_tools
. Could you help?