Closed ArneBinder closed 10 months ago
This reverts some of the changes of #366, i.e. it re-adds the following statistics (and simplifies their tests):
TokenCountCollector
FieldLengthCollector
SubFieldLengthCollector
DummyCollector
Note, that the SpanLengthCollector stays in pie-datasets, because it uses pie_datasets.document.conversion.tokenize_document.
SpanLengthCollector
pie_datasets.document.conversion.tokenize_document
This reverts some of the changes of #366, i.e. it re-adds the following statistics (and simplifies their tests):
TokenCountCollector
: Collects the token count of a field when tokenizing its content with a Huggingface tokenizer.FieldLengthCollector
: Collects the length of a field, e.g. to collect the number the characters in the input text.SubFieldLengthCollector
: Collects the length of a subfield in a field, e.g. to collect the number of arguments of N-ary relations.DummyCollector
: A dummy collector that always returns 1, e.g. to count the number of documents.Note, that the
SpanLengthCollector
stays in pie-datasets, because it usespie_datasets.document.conversion.tokenize_document
.