czbiohub-sf / orpheum

Orpheum (Previously called and published under sencha) is a Python package for directly translating RNA-seq reads into coding protein sequence.
MIT License
18 stars 4 forks source link

How big of a bloom filter should one use? #62

Open olgabot opened 4 years ago

olgabot commented 4 years ago

I think the current default is 1e8, with 4 tables, which is 50MB.

At some point I calculated the number of protein k-mers in ENSEMBL human as ~4 x 10^7, so 10^8 is a pretty good upper bound, and those files are ~50M which is reasonable. I think uniprot opsithokonta was on a similar order of magnitude, but uniprot UNIREF (everything!!) was more like 10^11 I think, but that includes microbes which have their own rules

(base)
 Tue 28 Apr - 19:01  ~/code/kmer-hashing/kh-tools   origin ☊ master ✔ 29☀ 
  ll *bloomfilter
Permissions Size User    Date Modified Git Name
.rw-r--r--   25M olgabot  4 Nov  2019   -N Homo_sapiens.GRCh38.pep.all.fa__tablesize1e8_ntables2.bloomfilter
.rw-r--r--   50M olgabot 27 Oct  2019   -N Homo_sapiens.GRCh38.pep.all.fa__tablesize1e8_ntables4.bloomfilter
.rw-r--r--  500M olgabot 27 Oct  2019   -N Homo_sapiens.GRCh38.pep.all.fa__tablesize1e9_ntables4.bloomfilter
.rw-r--r--  5.0G olgabot 27 Oct  2019   -N Homo_sapiens.GRCh38.pep.all.fa__tablesize1e10_ntables4.bloomfilter
.rw-r--r--   50M olgabot 27 Oct  2019   -N Homo_sapiens.GRCh38.pep.all.fa__tablesize100000000_ntables4.bloomfilter

cc @bluegenes

bluegenes commented 4 years ago

thanks @olgabot!