choose abundtrim memory parameters based on k-mer counting estimates

ctb commented 3 years ago

ref #51

Mo - so, khmer pre-allocate the memory even there's no need for it? titus:speech_balloon: 9 minutes ago Yes - after all, how do you know how many kmers there are if you haven’t read through the whole data set yet? titus:speech_balloon: 8 minutes ago MQF paper has to address same problem ;). In Khmer we use bloom filter based data structures that let us preallocate memory. Mo 7 minutes ago Got it now ^^ thanks titus:speech_balloon: 6 minutes ago https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0101271

titus:speech_balloon: 6 minutes ago :grin: titus:speech_balloon: 6 minutes ago (there WILL be a quiz… :grimacing: ) :laughing: 2

Mo 5 minutes ago haha I am thrilled Mo 4 minutes ago If we were to develop an estimation function for the distinct kmers count, would a tool like KmerStream or any alternative algorithm would reduce the need for the fixed memory requirement? Mo 4 minutes ago https://github.com/pmelsted/KmerStream

GitHubGitHub pmelsted/KmerStream Streaming algorithm for computing kmer statistics for massive genomics datasets - pmelsted/KmerStream titus:speech_balloon: 3 minutes ago We have hll implemented too. Still have to walk across the data once. :+1: 1

titus:speech_balloon: 1 minute ago now that I see your previous comment - yeah, we’ve explored this (extremely) thoroughly in khmer. with grist, we could totally integrate a k-mer counting scheme without much extra effort, since we already do multiple passes across the data. titus:speech_balloon: 1 minute ago …please let me know when the pull request is ready :grin: titus:speech_balloon: < 1 minute ago (more seriously, it’s just not that big an issue for grist, so I don’t worry about it. the high water mark for grist’s memory usage is the database search at the moment, and for genbank that’s ~120 GB, so …)

ctb commented 3 years ago

hey @mr-eyes note that https://github.com/dib-lab/genome-grist/pull/56 will add k-mer count estimation that could be used for this.

ctb commented 3 years ago

whoops, #56 adds it post trimming. So we'd need to adapt it to run pre-trimming.

bluegenes commented 3 years ago

Just re-upping that this would be really nice!

If not, can we explicitly suggest a number for metagenome_trim_memory for complex metagenomes in this section? I spy 20e9 in some configs on farm! Happy to make the PR if this option is preferred...

ctb commented 3 years ago

If not, can we explicitly suggest a number for metagenome_trim_memory for complex metagenomes in this section? I spy 20e9 in some configs on farm! Happy to make the PR if this option is preferred...

20e9 is good for most metagenomes (and trim-low-abund should fail properly if it's too low) but I was leary of putting that in as a default. But now we have system.conf and defaults.conf and so it seems appropriate to set that as a default and bake it into the documentation (which I think is only the README, currently).

Until sourmash 4.1 was released, the prefetch step was the high memory watermark (~120 GB sometimes!). And until #83 is merged and someone checks it :), I'm not sure what the memory usage is - my theory is that the memory usage was a combination of sourmash's memory requirements for database search (now much reduced) as well as keeping all the signatures in memory (which #83 fixes).

Note that @bryshalm recently ran into what seems to be out-of-memory errors with all of genbank, which we resolved by switching to the gtdb guide genomes database. She was using sourmash 4.1, so it wasn't the database searching that was the problem... my guess was it was the collection of signatures in memory. Interested to see if #83 fixes it!

ctb commented 3 years ago

with PR #83, and the following config file,

sample:
- SRR12324253
metagenome_trim_memory: 20e9
outdir: outputs.matt
sourmash_database_glob_pattern: /group/ctbrowngrp/gtdb/databases/gtdb-rs202.genomic.zip
prefetch_memory: 50e9

the max memory usage for doing the genbank prefetch against 280k genomes was 4670832K, or, by my calculations, 5 GB.

trying with genbank now.

ctb commented 3 years ago

against genbank, with the SBTs and --no-linear --prefetch, max memory usage was 62.8 GB, but this was probably because nearly 500k genomes were found (which seems strange, looking into that separately).

ctb commented 3 years ago

randomly recording this here...

so, using prefetch on SRR12324253 against genbank, @bryshalm found 405838 matching signatures. I wondered what this could possibly be, and so I did the following:

csvtk cut -f match_name *.prefetch.csv | cut -d\  -f2-3 | sort | uniq -c | sort -n | tail

and got the following:

...
  15964 Staphylococcus aureus
  30158 Listeria monocytogenes
  75036 Escherichia coli
 258360 Salmonella enterica

so it's all Salmonella!

Some good news, tho, is that with #83 and sourmash 4.1 we can handle 400,000 prefetch matches and do the whole gather and so on in ~5 hours, even for this many genomes!

Oh, and gather picks out only 18 genomes -

Cryptococcus neoformans AD hybrid strain=CBS 132, ASM699286v1
Saccharomyces cerevisiae strain=FDAARGOS_613, ASM1227524v1
Escherichia coli strain=blood-2011-0167, ASM78051v1
Salmonella enterica strain=FDAARGOS_609, ASM637049v1
Bacillus subtilis subsp. spizizenii ATCC 6633 strain=ATCC 6633, ASM609447v1
Escherichia coli, Hp_23-13_05
Listeria monocytogenes strain=FDAARGOS_607, ASM636403v1
Staphylococcus aureus strain=FDAARGOS_612, ASM637048v1
Cryptococcus neoformans var. neoformans JEC21 strain=JEC21, ASM9104v1
Lactobacillus fermentum strain=B1 28, ASM534142v1
Pseudomonas aeruginosa strain=FDAARGOS_610, ASM636479v1
Cryptococcus neoformans var. grubii H99 strain=H99, CNA3
Enterococcus faecalis strain=FDAARGOS_611, ASM636481v1
Escherichia coli strain=FDAARGOS_608, ASM637047v1
Saccharomyces pastorianus strain=Hybrid yeast 7, ASM148334v1
Bacillus paranthracis strain=DE0088, ASM768213v1
Saccharomyces cerevisiae strain=JT10.10, ASM327081v1
Saccharomyces cerevisiae strain=beer063, ASM173993v1

bluegenes commented 3 years ago

the max memory usage for doing the genbank prefetch against 280k genomes was 4670832K, or, by my calculations, 5 GB.

How long did this take?

ctb commented 3 years ago

04:45:20 - that's just for the genbank steps (signature already existed, so just sourmash prefetch, gather, and split). The split method can be made faster tho, see issue #32.

dib-lab / genome-grist

choose abundtrim memory parameters based on k-mer counting estimates #53