allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
972 stars 107 forks source link

dedupe_paragraphs didn't working #218

Closed wannaphong closed 1 week ago

wannaphong commented 1 week ago

I can't use dedupe_paragraphs for dataset. Could you help me troubleshoot this?

My config json:

{
    "documents": [
      "./dataset/documents/data.jsonl.gz"
    ],
    "dedupe": {
      "name": "dedupe_paragraphs",
      "documents": {
        "attribute_name": "bff_duplicate_paragraph_spans",
        "key": "text"
      },
      "skip_empty": true
    },
    "bloom_filter": {
      "file": "./dedupe_paragraph_spans_bloom_filter.bin",
      "read_only": false,
      "estimated_doc_count": 200,
      "desired_false_positive_rate": 0.0001
    },
    "processes": 1
}

Log:

bloom_filter:
  desired_false_positive_rate: 0.0001
  estimated_doc_count: 200
  file: dedupe_paragraph_spans_bloom_filter.bin
  read_only: false
  size_in_bytes: 0
compression:
  input: null
  output: null
dedupe:
  documents:
    attribute_name: bff_duplicate_paragraph_spans
    key: text
  min_length: 0
  min_words: 0
  name: dedupe_paragraphs
  num_partitions: 1
  partition_index: 0
  skip_empty: true
documents:
- ./dataset/documents/data.jsonl.gz
is_s3_volume: false
processes: 1
work_dir:
  input: /tmp/dolma-input-7ckvonki
  output: /tmp/dolma-output-drzxhv8f
[2024-10-23T10:07:17Z INFO  dolma::bloom_filter] Loading bloom filter from "dedupe_paragraph_spans_bloom_filter.bin"...
[2024-10-23T10:07:17Z INFO  dolma::deduper] Writing attributes for dataset/documents/data.jsonl.gz to dataset/attributes/dedupe_paragraphs/data.jsonl.gz.tmp
[2024-10-23T10:07:17Z INFO  dolma::deduper] Writing attributes for dataset/documents/data.jsonl.gz to dataset/attributes/dedupe_paragraphs/data.jsonl.gz.tmp
thread '<unnamed>' panicked at src/deduper.rs:208:26:
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: "Failed to parse rule:  --> 1:1\n  |\n1 | text\n  | ^---\n  |\n  = expected chain" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2024-10-23T10:07:17Z INFO  dolma::deduper] Writing bloom filter to "dedupe_paragraph_spans_bloom_filter.bin"...
[2024-10-23T10:07:17Z INFO  dolma::deduper] Bloom filter written.
[2024-10-23T10:07:17Z INFO  dolma::deduper] Done!
wannaphong commented 1 week ago

I can run bff_duplicate_docs but dedupe_paragraphs and pii can't working.

wannaphong commented 1 week ago

Closed. I changed "documents" to "paragraphs" and it worked.