dnbaker / dashing

Fast and accurate genomic distances using HyperLogLog
GNU General Public License v3.0
161 stars 11 forks source link

dist: with lot of data binary output "-b" crashes but csv output "-T" does not #64

Open tsp-kucbd opened 3 years ago

tsp-kucbd commented 3 years ago

When running dashing dist with 400,000 genomes the program succeeds when asking for csv "-T" output, but crashes when asking for binary "-b" output. Both commands succeed without problem when using 10,000 geomes only.

This works:

cat list_of_genomes|wc -l
412656
./dashing_s512 dist -F list_of_genomes  -k 15 -S16 -p 39 -M -T --use-nthash --cache-sketches |pigz > NGOT.dashdist.tsv.gz

Memory usage 330Gb over ca. 3 days on a 1.5TB machine with 40 processors (most of the time is spent writing the matrix to disk)

This crashes after ca. 3 minutes

 ./dashing_s512 dist -F list_of_genomes  -k 15 -S16 -p 39 -M -b -o labelsB --use-nthash -O  NGOT.dashdists.bin

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[1]    40318 abort (core dumped)  ./dashing_s512 dist -F list_of_genomes -k 15 -S16 -p 39 -M -b -o labelsB  -O

Peak memory 30.6Gb

Update (genome sizes are around 30kb): 300,000 genomes does not seem to crash (8_762_264_277 nucleotides) 310,000 genomes crash (8_958_253_751 nucleotides)

dnbaker commented 3 years ago

Hi, @tsp-kucbd!

Thanks for making this issue. I'll investigate this and get you a fix + updated binaries as soon as I can.

Best,

Daniel

dnbaker commented 3 years ago

Hi again,

I have a potential fix dev branch, and I'll have working binaries for the v0.5.5 release tomorrow.

Thanks for pointing this out, and I hope this solves the problem.

tsp-kucbd commented 3 years ago

Thank you! I noticed that the binary option does only work for symmetric "all vs all" runs, but not for -F and -Q asymmetric distance calculations. As the resulting distance matrixes are getting huge, it can get quite difficult to read the whole thing into memory. For a future release, would it be possible to get a pairwise distance reporting option? Potentially with a upper distance value threshold?

dnbaker commented 3 years ago

Thanks for pointing this out for asymmetric comparisons. I've corrected this in the current dev branch, from which you could build from source now or wait for the next release for binaries.

We are also planning a new interface for sparsified distance calculations, which would return a CSR-format sparse matrix, only reporting hits above/below thresholds in the coming weeks. I'll let you know when it's available.

tsp-kucbd commented 3 years ago

It seems the new version has troubles with the -T flag Instead of an expected tsv file, one get binary output instead of a text file

./dashing_s512 cmp -p 10 -M -k15 -S16 -W -F test.list -T

Dashing version: v0.5.4-24-g10cf
#Path   Size (est.)
Microcystis_virus_Ma-LMM01.fasta.gz  1
Stx2-converting_phage_86.fasta.gz    1
Staphylococcus_prophage_phiPV83.fasta.gz     1
Staphylococcus_virus_phiSLT.fasta.gz 1
Enterobacteria_phage_ST104.fasta.gz  1
Staphylococcus_phage_PVL.fasta.gz    1
Bacillus_phage_phi105.fasta.gz       1
Pseudomonas_virus_phiCTX.fasta.gz    1
Vibrio_virus_Kappa.fasta.gz  1
Thermus_virus_IN93.fasta.gz  1

���#���#���#���#���#��L$��L$���#��������#�$����#������#�$����#�����#�$����#����#�$����#���#�$����#����#��L$�$��L$���#%

where as with the previous version it generates txt output as expected

dashing_5.8 cmp -p 10 -M -k15 -S16 -W -F test.list -T
Dashing version: v0.5-8-g91e5
#Path   Size (est.)
Microcystis_virus_Ma-LMM01.fasta.gz  1
Stx2-converting_phage_86.fasta.gz    1
Staphylococcus_prophage_phiPV83.fasta.gz     1
Staphylococcus_virus_phiSLT.fasta.gz 1
Enterobacteria_phage_ST104.fasta.gz  1
Staphylococcus_phage_PVL.fasta.gz    1
Bacillus_phage_phi105.fasta.gz       1
Pseudomonas_virus_phiCTX.fasta.gz    1
Vibrio_virus_Kappa.fasta.gz  1
Thermus_virus_IN93.fasta.gz  1
#Names  Microcystis_virus_Ma-LMM01.fasta.gz  Stx2-converting_phage_86.fasta.gz      Staphylococcus_prophage_phiPV83.fasta.gz     Staphylococcus_virus_phiSLT.fasta.gz   Enterobacteria_phage_ST104.fasta.gz  Staphylococcus_phage_PVL.fasta.gz      Bacillus_phage_phi105.fasta.gz       Pseudomonas_virus_phiCTX.fasta.gz      Vibrio_virus_Kappa.fasta.gz  Thermus_virus_IN93.fasta.gz
Microcystis_virus_Ma-LMM01.fasta.gz  0.000000        0.000000        0.000000        0.000000        0.000000        0.000000  0.000000        0.000000        0.000000        -0.000000
Stx2-converting_phage_86.fasta.gz    0.000000        0.000000        -0.000000       -0.000000       -0.000000       -0.000000 0.000000        0.000000        -0.000000       0.000000
Staphylococcus_prophage_phiPV83.fasta.gz     0.000000        -0.000000       0.000000        -0.000000       -0.000000-0.000000        0.000000        0.000000        -0.000000       0.000000
Staphylococcus_virus_phiSLT.fasta.gz 0.000000        -0.000000       -0.000000       0.000000        -0.000000       -0.000000 0.000000        0.000000        -0.000000       0.000000
Enterobacteria_phage_ST104.fasta.gz  0.000000        -0.000000       -0.000000       -0.000000       0.000000        -0.000000 0.000000        0.000000        -0.000000       0.000000
Staphylococcus_phage_PVL.fasta.gz    0.000000        -0.000000       -0.000000       -0.000000       -0.000000       0.000000  0.000000        0.000000        -0.000000       0.000000
Bacillus_phage_phi105.fasta.gz       0.000000        0.000000        0.000000        0.000000        0.000000        0.000000  0.000000        -0.000000       0.000000        0.000000
Pseudomonas_virus_phiCTX.fasta.gz    0.000000        0.000000        0.000000        0.000000        0.000000        0.000000  -0.000000       0.000000        0.000000        0.000000
Vibrio_virus_Kappa.fasta.gz  0.000000        -0.000000       -0.000000       -0.000000       -0.000000       -0.0000000.000000 0.000000        0.000000        0.000000
Thermus_virus_IN93.fasta.gz  -0.000000       0.000000        0.000000        0.000000        0.000000        0.000000 0.000000 0.000000        0.000000        0.000000
dnbaker commented 3 years ago

You're right! I've uploaded another version overwriting the old v0.5.5, and it should produce some results that look like this:

./dashing_s256 dist -T bonsai/test/*.fna.gz
Dashing version: v0.5.5-4-gb11d
#Path   Size (est.)
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  4829255
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   2718859
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  2433839
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   2368528
#Namesbonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz    bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz  0   0   0   0
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz   0   0   0   0
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz  0   0   0   0.550403
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz   0   0   0.550403    0

I'll have to add more tests for differing output modes.

Thanks,

Daniel