allenai / bff

Apache License 2.0
37 stars 8 forks source link

Added i) directory support, ii) FP rate args, iii) No-Save option #11

Open revbucket opened 6 months ago

revbucket commented 6 months ago

Several changes to main.rs:

  1. Added progress bar printouts vs printouts at each filename (tried to use similar formatting as in wimbd)
  2. Added directory support for inputs (can pass dir as input now, only selects .json*.gz files). Same as wimbd
  3. Added option to not save bloom filter at the end (but update during)
  4. Added option to compute filter size dynamically given a desired false-positive rate
revbucket commented 6 months ago

Oh and some more notes about point 4: We search for a size of bloom filter using an accurate ngram count and then find the smallest filter that results in that fp_rate. We return min(filter_size, 0.90 * systemRAM) (so we don't allocate too much memory -- maybe I should warn if we get to this point though!)