Closed rodneykinney closed 6 months ago
Add optional by_ngram section to the dedupe.paragraph config object.
by_ngram
dedupe.paragraph
Constructs ngrams of a specified length and given stride. Fall back to whole-content deduping if the paragraph has fewer tokens than the ngram length
Still need to fix test data
Made an independent copy of the 000.json.gz input documents file, modified to accommodate the added test. Everything passing now
000.json.gz
Add optional
by_ngram
section to thededupe.paragraph
config object.Constructs ngrams of a specified length and given stride. Fall back to whole-content deduping if the paragraph has fewer tokens than the ngram length