allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
976 stars 108 forks source link

Simplify how rules in the mixer are provided #50

Open soldni opened 1 year ago

peterbjorgensen commented 1 year ago

Because of the strange requirement on how to specify logical filter rules for fields that does not exist for all documents, I looked into this behaviour and it turned out to be a bug in jsonpath-rust, which is now fixed. You may want to update to a newer version of jsonpath-rust to include this fix. https://github.com/besok/jsonpath-rust/pull/47

peterbjorgensen commented 8 months ago

Dolma still depends on the broken version of jsonpath-rust (0.3.0) or older. The bugfix mentioned above is included in the latest releases. I think the oldest version with the fix included is 0.3.3. I would recommend bumping the version in the Cargo.toml. The latest version is 0.4.0.

soldni commented 7 months ago

This is nice; I will bump in the next version @peterbjorgensen! In the meantime, I recently added support for specifying rules using jq syntax (not the default, but can be used by specifying syntax: jq, e.g.):

streams:
  - name: falcon
    documents:
      - s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/*
    attributes:
      - dedupe_para_ngrams_13_1
      - pii_regex_with_counts_fast_v2
      - tokenizer_repetitions_v2r2
    output:
      max_size_in_bytes: 3_814_697_265
      path: s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v1/documents
      min_text_length: 25
      discard_fields:
        - attributes
    filter:
      include:
        # computes average duplication factor and only keep docs with less than 30% duplication
        - >-
          (.attributes.dedupe_para_ngrams_13_1 | length == 0) or
          ((.attributes.dedupe_para_ngrams_13_1 | map(.[2] * (.[1] - .[0])) | add) / (.text | length) <= 0.3)
      exclude:
        # Remove documents with more than 10 repeated ngrams
        - >-
          (.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition != null) and
          (.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition[0][-1] > 10)

        # PII filter
        - .attributes.pii_regex_with_counts_fast_v2__pii_regex_with_counts_fast_v2__doc_count[0][-1] > 5
      syntax: jq

processes: 188
peterbjorgensen commented 6 months ago

cool, isn't there a .attributes missing in the example for the exclude filters, i.e. it should be .attributes.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition

Is it also possible to filter on document metadata, such as .metadata.sub-source == "mygoodsource"?