I think the current default is 1e8, with 4 tables, which is 50MB.
At some point I calculated the number of protein k-mers in ENSEMBL human as ~4 x 10^7, so 10^8 is a pretty good upper bound, and those files are ~50M which is reasonable. I think uniprot opsithokonta was on a similar order of magnitude, but uniprot UNIREF (everything!!) was more like 10^11 I think, but that includes microbes which have their own rules
(base)
Tue 28 Apr - 19:01 ~/code/kmer-hashing/kh-tools origin ☊ master ✔ 29☀
ll *bloomfilter
Permissions Size User Date Modified Git Name
.rw-r--r-- 25M olgabot 4 Nov 2019 -N Homo_sapiens.GRCh38.pep.all.fa__tablesize1e8_ntables2.bloomfilter
.rw-r--r-- 50M olgabot 27 Oct 2019 -N Homo_sapiens.GRCh38.pep.all.fa__tablesize1e8_ntables4.bloomfilter
.rw-r--r-- 500M olgabot 27 Oct 2019 -N Homo_sapiens.GRCh38.pep.all.fa__tablesize1e9_ntables4.bloomfilter
.rw-r--r-- 5.0G olgabot 27 Oct 2019 -N Homo_sapiens.GRCh38.pep.all.fa__tablesize1e10_ntables4.bloomfilter
.rw-r--r-- 50M olgabot 27 Oct 2019 -N Homo_sapiens.GRCh38.pep.all.fa__tablesize100000000_ntables4.bloomfilter
I think the current default is 1e8, with 4 tables, which is 50MB.
At some point I calculated the number of protein k-mers in ENSEMBL human as ~4 x 10^7, so 10^8 is a pretty good upper bound, and those files are ~50M which is reasonable. I think uniprot opsithokonta was on a similar order of magnitude, but uniprot UNIREF (everything!!) was more like 10^11 I think, but that includes microbes which have their own rules
cc @bluegenes