Open soldni opened 1 year ago
Dolma still depends on the broken version of jsonpath-rust (0.3.0) or older. The bugfix mentioned above is included in the latest releases. I think the oldest version with the fix included is 0.3.3. I would recommend bumping the version in the Cargo.toml. The latest version is 0.4.0.
This is nice; I will bump in the next version @peterbjorgensen! In the meantime, I recently added support for specifying rules using jq syntax (not the default, but can be used by specifying syntax: jq
, e.g.):
streams:
- name: falcon
documents:
- s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/*
attributes:
- dedupe_para_ngrams_13_1
- pii_regex_with_counts_fast_v2
- tokenizer_repetitions_v2r2
output:
max_size_in_bytes: 3_814_697_265
path: s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v1/documents
min_text_length: 25
discard_fields:
- attributes
filter:
include:
# computes average duplication factor and only keep docs with less than 30% duplication
- >-
(.attributes.dedupe_para_ngrams_13_1 | length == 0) or
((.attributes.dedupe_para_ngrams_13_1 | map(.[2] * (.[1] - .[0])) | add) / (.text | length) <= 0.3)
exclude:
# Remove documents with more than 10 repeated ngrams
- >-
(.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition != null) and
(.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition[0][-1] > 10)
# PII filter
- .attributes.pii_regex_with_counts_fast_v2__pii_regex_with_counts_fast_v2__doc_count[0][-1] > 5
syntax: jq
processes: 188
cool, isn't there a .attributes
missing in the example for the exclude filters, i.e. it should be .attributes.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition
Is it also possible to filter on document metadata, such as .metadata.sub-source == "mygoodsource"
?
Because of the strange requirement on how to specify logical filter rules for fields that does not exist for all documents, I looked into this behaviour and it turned out to be a bug in jsonpath-rust, which is now fixed. You may want to update to a newer version of jsonpath-rust to include this fix. https://github.com/besok/jsonpath-rust/pull/47