allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
894 stars 90 forks source link

Deduplication / Decontamination #174

Open chschroeder opened 2 months ago

chschroeder commented 2 months ago

Hi,

dolma is a wonderful tool, and I m successfully using it for many steps of my pipeline.

Strangely, I can manage to get it working for (paragraph-level) deduplication. When applied in a similar setting, for decontamination, however, it never assigns any attributes:

What is the problem?

Compared to the "normal" paragraph deduplication, when trying to just apply an existing bloom filter, there are no dedupe attributes in the resulting attribute files. I have already experimented with the desired_false_positive_rate overlap_threshold parameter, but without any success.

{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL1"}
{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL2"}
{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL3"}

Infos about my setup:

I am using the latest dolma 1.0.3 release. My latest minimum working example is based on configs/dolma-v1_5/decontamination.

Here are my config files create-bloomfilter.yaml: ``` documents: - benchmarks.jsonl.gz # these are the files I want to filter with the decontamination step dedupe: name: decontaminate paragraphs: attribute_name: paragraphs_bff_duplicates skip_empty: true bloom_filter: read_only: false estimated_doc_count: 73543 #size_in_bytes: 104857 # 100 MB; smaller causes too many FPs desired_false_positive_rate: 1e-3 # TOD: 1e-15 file: decontamination_bloom_filter.bin processes: 4 ``` decontaminate.yaml: ``` documents: - tmp/v0/documents/*.gz work_dir: input: work/para/input output: work/para/output dedupe: name: decontaminate paragraphs: attribute_name: paragraphs_bff_duplicates skip_empty: true bloom_filter: read_only: true estimated_doc_count: 288347 desired_false_positive_rate: 1e-3 file: decontamination_bloom_filter.bin processes: 3 ```
Here is the output dolma -c create-bloomfilter.yaml dedupe ``` bloom_filter: desired_false_positive_rate: 0.001 estimated_doc_count: 73543 file: decontamination_bloom_filter.bin read_only: false size_in_bytes: 0 dedupe: min_length: 0 min_words: 0 name: decontaminate paragraphs: attribute_name: paragraphs_bff_duplicates by_ngram: ngram_length: 0 overlap_threshold: 1.0 skip_short_paragraphs: false stride: 0 paragraph_separator: ' ' skip_empty: true documents: - benchmarks.jsonl.gz processes: 4 work_dir: input: /tmp/dolma-input-1rmq0gbx output: /tmp/dolma-output-ky8van2k [2024-06-27T12:34:26Z INFO dolma::bloom_filter] Loading bloom filter from "decontamination_bloom_filter.bin"... [2024-06-27T12:34:26Z INFO dolma::deduper] Skipping "/disk/cschroeder/workspaces/dolma/benchmarks.jsonl.gz" because it already exists [2024-06-27T12:34:26Z INFO dolma::deduper] Writing bloom filter to "decontamination_bloom_filter.bin"... [2024-06-27T12:34:26Z INFO dolma::deduper] Bloom filter written. [2024-06-27T12:34:26Z INFO dolma::deduper] Done! ``` dolma -c decontaminate.yaml dedupe ``` bloom_filter: desired_false_positive_rate: 0.1 estimated_doc_count: 288347 file: decontamination_bloom_filter.bin read_only: true size_in_bytes: 0 dedupe: min_length: 0 min_words: 0 name: decontaminate paragraphs: attribute_name: paragraphs_bff_duplicates by_ngram: ngram_length: 0 overlap_threshold: 1.0 skip_short_paragraphs: false stride: 0 paragraph_separator: ' ' skip_empty: true documents: - tmp/v0/documents/*.gz processes: 3 work_dir: input: work/para/input output: work/para/output [2024-06-27T12:38:17Z INFO dolma::bloom_filter] Loading bloom filter from "decontamination_bloom_filter.bin"... [2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0000.json.gz to tmp/v0/attributes/decontaminate/part-0000.json.gz.tmp [2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0000.json.gz to tmp/v0/attributes/decontaminate/part-0000.json.gz.tmp [2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0002.json.gz to tmp/v0/attributes/decontaminate/part-0002.json.gz.tmp [2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0001.json.gz to tmp/v0/attributes/decontaminate/part-0001.json.gz.tmp [2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0002.json.gz to tmp/v0/attributes/decontaminate/part-0002.json.gz.tmp [2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0001.json.gz to tmp/v0/attributes/decontaminate/part-0001.json.gz.tmp [2024-06-27T12:38:19Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0000.json.gz" after deduping... [2024-06-27T12:38:19Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0001.json.gz" after deduping... [2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0003.json.gz to tmp/v0/attributes/decontaminate/part-0003.json.gz.tmp [2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0003.json.gz to tmp/v0/attributes/decontaminate/part-0003.json.gz.tmp [2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0004.json.gz to tmp/v0/attributes/decontaminate/part-0004.json.gz.tmp [2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0004.json.gz to tmp/v0/attributes/decontaminate/part-0004.json.gz.tmp [2024-06-27T12:38:19Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0002.json.gz" after deduping... [2024-06-27T12:38:22Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0003.json.gz" after deduping... [2024-06-27T12:38:22Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0004.json.gz" after deduping... [2024-06-27T12:38:22Z INFO dolma::deduper] Writing bloom filter to "decontamination_bloom_filter.bin"... [2024-06-27T12:38:22Z INFO dolma::deduper] Bloom filter written. [2024-06-27T12:38:22Z INFO dolma::deduper] Done! ```

Am I missing somehting?

Reventh-Sharma commented 3 weeks ago

I m facing same issue. Did you find anything

chschroeder commented 3 weeks ago

Unfortunately not, I stopped trying after this.

ahmeda14960 commented 2 weeks ago

Hi all, wanted to bring up this thread because I am running into the same issue as well. If anyone has tips it would be greatly appreciated!