allenai / bff

Apache License 2.0
37 stars 8 forks source link

add option to always hash whole paragraphs #6

Closed IanMagnusson closed 1 year ago

IanMagnusson commented 1 year ago

Adds a simple arg to always match hashes of whole paragraphs (so long as they are longer than min_ngram_size). This is something we frequently use for decontamination but have previously had to hack by setting a very large max_ngram_size.