Closed dirkgr closed 1 year ago
To be sure, I don't know what the point of this is. It would be insane to run deduplication this way. It would be borderline insane to even count ngrams this way (i.e., with --annotate-only
). And it's not implementing the full GPT2 spec from #3 either, which is already insane all by itself.
So I'm just writing the code and I trust that you know what you're doing with it.
Other changes:
"bff_duplicate_spans"
and"bff_contained_ngram_count"
. Those names have changed.bff
will always create the"bff_contained_ngram_count"
field, even when running deduplication.