bluegenes / 2021-virus-exploration

0 stars 0 forks source link

find_founders cluster exploration #1

Open bluegenes opened 3 years ago

bluegenes commented 3 years ago

Using a random subset of 21,562 protein signatures from prodigal translation of pigeon1.0 --

  1. threshold = 0.1; protein k=7; batch size = 5000;

time python find-founders.py -k 7 --moltype protein --siglist test-sigs.siglist.txt --batch-size 5000 --prefix test.prot7.mc0.1 --threshold 0.1

clusters:

  15317 test.prot7.mc0.1.founders.siglist.txt
   6245 test.prot7.mc0.1.members.siglist.txt
  21562 total

time: real 84m25.249s

  1. threshold = 0.05; protein k=7; batch size = 5000

time python find-founders.py -k 7 --moltype protein --siglist test-sigs.siglist.txt --batch-size 5000 --prefix test.prot7.mc0.05 --threshold 0.05

real 42m6.182s

clusters:

  12045 test.prot7.mc0.05.founders.siglist.txt
   9517 test.prot7.mc0.05.members.siglist.txt
  21562 total
  1. threshold = 0.05; protein k=7; batch size = 10000

time python find-founders.py -k 7 --moltype protein --siglist test-sigs.siglist.txt --batch-size 10000 --prefix test.prot7.mc0.05_bs10000 --threshold 0.05

time: real 42m50.027s

clusters:

  12049 test.prot7.mc0.05_bs10000.founders.siglist.txt
   9513 test.prot7.mc0.05_bs10000.members.siglist.txt
  21562 total
  1. threshold = 0.05; protein k=10; batch size = 5000

time python find-founders.py -k 10 --moltype protein --siglist test-sigs.siglist.txt --batch-size 5000 --prefix test.prot10.mc0.05 --threshold 0.05

time: real 55m14.529s

clusters:

  14115 test.prot10.mc0.05.founders.siglist.txt
   7447 test.prot10.mc0.05.members.siglist.txt
  21562 total
bluegenes commented 3 years ago

Doing protein-level clustering of all 266k genomes --

time python find-founders.py -k 10 --moltype protein --siglist /home/ntpierce/2021-virus-exploration/output.protein-pigeon/compare/pigeon1.0.prodigal.siglist.txt --prefix pigeon1.0.protein-k10.mc0.05 --threshold 0.05

time: ~2.4 days real 3411m14.916s

Found 92,280 founders

92280 pigeon1.0.protein-k10.mc0.05.founders.siglist.csv
174525 pigeon1.0.protein-k10.mc0.05.members.siglist.csv

batched rarefaction:

image

dna-level clustering (k21) killed after 5 days (srun issue) -- dropping mc threshold to 0.01 and restarting.

bluegenes commented 3 years ago

DNA k21 with mc 0.01:

time python ../find-founders.py -k 21 --moltype DNA --siglist /group/ctbrowngrp/virus-references/pigeon/dna-input/pigeon1.0.signatures.txt --prefix pigeon1.0.dna-k21.mc0.01 --threshold 0.01

time ~3.5 days real 5030m2.497s

Found 108355 founders

108355 pigeon1.0.dna-k21.mc0.01.founders.siglist.csv
158450 pigeon1.0.dna-k21.mc0.01.members.siglist.csv
bluegenes commented 3 years ago

Whoops, forgot to add info on dayhoff clustering, full pigeon database:

Pigeon -- dayhoff clustering

dayhoff k=19, max containment 0.05

   85967 pigeon1.0.dayhoff-k19.mc0.05.founders.siglist.csv
  180838 pigeon1.0.dayhoff-k19.mc0.05.members.siglist.csv

GTDB

Not virus, BUT, I ran this on the gtdb representative set as well, dropping here for now: dayhoff k=19, max containment 0.1

22674 gtdbr95rep.dayhoff-k19.mc0.1.members.siglist.csv
9236 gtdbr95rep.dayhoff-k19.mc0.1.founders.siglist.csv

protein k=10, max containment 0.05

    6097 gtdbr95rep.protein-k10.mc0.05.founders.siglist.csv
   25813 gtdbr95rep.protein-k10.mc0.05.members.siglist.csv

protein k=10, max containment 0.1

   10814 gtdbr95rep.protein-k10.mc0.1.founders.siglist.csv
   21096 gtdbr95rep.protein-k10.mc0.1.members.siglist.csv