Closed wannaphong closed 1 week ago
I can't use dedupe_paragraphs for dataset. Could you help me troubleshoot this?
My config json:
{ "documents": [ "./dataset/documents/data.jsonl.gz" ], "dedupe": { "name": "dedupe_paragraphs", "documents": { "attribute_name": "bff_duplicate_paragraph_spans", "key": "text" }, "skip_empty": true }, "bloom_filter": { "file": "./dedupe_paragraph_spans_bloom_filter.bin", "read_only": false, "estimated_doc_count": 200, "desired_false_positive_rate": 0.0001 }, "processes": 1 }
Log:
bloom_filter: desired_false_positive_rate: 0.0001 estimated_doc_count: 200 file: dedupe_paragraph_spans_bloom_filter.bin read_only: false size_in_bytes: 0 compression: input: null output: null dedupe: documents: attribute_name: bff_duplicate_paragraph_spans key: text min_length: 0 min_words: 0 name: dedupe_paragraphs num_partitions: 1 partition_index: 0 skip_empty: true documents: - ./dataset/documents/data.jsonl.gz is_s3_volume: false processes: 1 work_dir: input: /tmp/dolma-input-7ckvonki output: /tmp/dolma-output-drzxhv8f [2024-10-23T10:07:17Z INFO dolma::bloom_filter] Loading bloom filter from "dedupe_paragraph_spans_bloom_filter.bin"... [2024-10-23T10:07:17Z INFO dolma::deduper] Writing attributes for dataset/documents/data.jsonl.gz to dataset/attributes/dedupe_paragraphs/data.jsonl.gz.tmp [2024-10-23T10:07:17Z INFO dolma::deduper] Writing attributes for dataset/documents/data.jsonl.gz to dataset/attributes/dedupe_paragraphs/data.jsonl.gz.tmp thread '<unnamed>' panicked at src/deduper.rs:208:26: called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: "Failed to parse rule: --> 1:1\n |\n1 | text\n | ^---\n |\n = expected chain" } note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace [2024-10-23T10:07:17Z INFO dolma::deduper] Writing bloom filter to "dedupe_paragraph_spans_bloom_filter.bin"... [2024-10-23T10:07:17Z INFO dolma::deduper] Bloom filter written. [2024-10-23T10:07:17Z INFO dolma::deduper] Done!
I can run bff_duplicate_docs but dedupe_paragraphs and pii can't working.
Closed. I changed "documents" to "paragraphs" and it worked.
I can't use dedupe_paragraphs for dataset. Could you help me troubleshoot this?
My config json:
Log: