deduplication examples does not work

TTTTao725 commented 9 months ago

I tested the document deduplication command on a sample dataset got the following error.

To replicate the problem I encountered, I've generated a dummy dataset consisting of six documents:

import gzip
import json

documents = [
    {"id": str(i), "text": 'dummy', "source": "peS2o"} for i in range(5)
]

documents.append({"id": str(5), "text": 'dummy_test', "source": "peS2o"})

file_path = '/foo/test_on_dummy_dataset/dummy_data.jsonl.gz'

with gzip.open(file_path, 'wt', encoding='UTF-8') as f:
    for document in documents:
        f.write(json.dumps(document) + '\n')

Then tried to run:

RUST_BACKTRACE=full dolma dedupe \
    --documents "/foo/test_on_dummy_dataset/documents/dummy_data.jsonl.gz" \
    --dedupe.documents.attribute_name 'duplicate_documents' \
    --dedupe.skip_empty \
    --bloom_filter.file /tmp/deduper_bloom_filter_test.bin \
    --no-bloom_filter.read_only \
    --bloom_filter.estimated_doc_count '5' \
    --bloom_filter.desired_false_positive_rate '0.0001' \
    --processes 16

(P.S I have also tried to reduce the number of processes to 1, didn't work out either.)

where I obtain the error:

bloom_filter:
  desired_false_positive_rate: 0.0001
  estimated_doc_count: 5
  file: deduper_bloom_filter_test.bin
  read_only: false
  size_in_bytes: 0
dedupe:
  documents:
    attribute_name: duplicate_documents
    key: ???
  name: duplicate_documents
  skip_empty: true
documents:
- /work/github/test_on_dummy_dataset/documents/dummy_data.jsonl.gz
processes: 16
work_dir:
  input: /tmp/dolma-input-4uglco2b
  output: /tmp/dolma-output-wrrv_wfd
[2023-12-18T15:23:01Z INFO  dolma::bloom_filter] Loading bloom filter from "deduper_bloom_filter_test.bin"...
[2023-12-18T15:23:01Z INFO  dolma::deduper] Writing attributes for /work/github/test_on_dummy_dataset/documents/dummy_data.jsonl.gz to /work/github/test_on_dummy_dataset/attributes/duplicate_documents/dummy_data.jsonl.gz.tmp
[2023-12-18T15:23:01Z INFO  dolma::deduper] Writing attributes for /work/github/test_on_dummy_dataset/documents/dummy_data.jsonl.gz to /work/github/test_on_dummy_dataset/attributes/duplicate_documents/dummy_data.jsonl.gz.tmp
thread '<unnamed>' panicked at src/deduper.rs:152:26:
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: " --> 1:1\n  |\n1 | ???\n  | ^---\n  |\n  = expected chain" }
stack backtrace:
   0:     0x7f01e9d139ec - std::backtrace_rs::backtrace::libunwind::trace::he43a6a3949163f8c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x7f01e9d139ec - std::backtrace_rs::backtrace::trace_unsynchronized::h50db52ca99f692e7
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7f01e9d139ec - std::sys_common::backtrace::_print_fmt::hd37d595f2ceb2d3c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x7f01e9d139ec - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h678bbcf9da6d7d75
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7f01e9d4297c - core::fmt::rt::Argument::fmt::h3a159adc080a6fc9
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/rt.rs:138:9
   5:     0x7f01e9d4297c - core::fmt::write::hb8eaf5a8e45a738e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/mod.rs:1094:21
   6:     0x7f01e9d0fd8e - std::io::Write::write_fmt::h9663fe36b2ee08f9
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/io/mod.rs:1714:15
   7:     0x7f01e9d137d4 - std::sys_common::backtrace::_print::hcd4834796ee88ad2
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x7f01e9d137d4 - std::sys_common::backtrace::print::h1360e9450e4f922a
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x7f01e9d14f03 - std::panicking::default_hook::{{closure}}::h2609fa95cd5ab1f4
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:270:22
  10:     0x7f01e9d14c1c - std::panicking::default_hook::h6d75f5747cab6e8d
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:290:9
  11:     0x7f01e9d15489 - std::panicking::rust_panic_with_hook::h57e78470c47c84de
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:707:13
  12:     0x7f01e9d15387 - std::panicking::begin_panic_handler::{{closure}}::h3dfd2453cf356ecb
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:599:13
  13:     0x7f01e9d13f16 - std::sys_common::backtrace::__rust_end_short_backtrace::hdb177d43678e4d7e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:170:18
  14:     0x7f01e9d150d2 - rust_begin_unwind
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
  15:     0x7f01e9628673 - core::panicking::panic_fmt::hd1e971d8d7c78e0e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
  16:     0x7f01e9628a7a - core::result::unwrap_failed::hccb456d39e9c31fc
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1652:5
  17:     0x7f01e9709c58 - <F as threadpool::FnBox>::call_box::h26117aa625de9352
  18:     0x7f01e9cc82b0 - std::sys_common::backtrace::__rust_begin_short_backtrace::he93b09d651d5d863
  19:     0x7f01e9cc403a - core::ops::function::FnOnce::call_once{{vtable.shim}}::hcf05def3d47db391
  20:     0x7f01e9d1a955 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::haadd4e5af2ab0d62
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  21:     0x7f01e9d1a955 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::he4ba1fb09c16d807
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  22:     0x7f01e9d1a955 - std::sys::unix::thread::Thread::new::thread_start::he524ecf4b47bee95
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
  23:     0x7f01ea51eac3 - <unknown>
  24:     0x7f01ea5b0a40 - <unknown>
  25:                0x0 - <unknown>
[2023-12-18T15:23:01Z INFO  dolma::deduper] Writing bloom filter to "deduper_bloom_filter_test.bin"...
[2023-12-18T15:23:02Z INFO  dolma::deduper] Bloom filter written.
[2023-12-18T15:23:02Z INFO  dolma::deduper] Done!

soldni commented 8 months ago

Hi!

Thank you for the bug report. To clarify, is this using the version currently available on PyPI, or is it running on main branch?

peterbjorgensen commented 8 months ago

I think the problem is that the document level deduplication option is meant to be used with document metadata, so you can filter out documents with the same URL for example. So if you set --dedupe.documents.attribute_name 'duplicate_documents' you must also set this option:

--dedupe.documents.key [DEDUPE.DOCUMENTS.KEY]
                        Name of the input field to use for deduplication, e.g. `$.metadata.url`

The actual content-based deduplication runs on paragraphs, but you can filter entire documents if too many paragraphs are duplicates. This is not super clear in the documentation. If you set --dedupe.paragraphs.attribute_name 'bff_duplicate_paragraph_spans' instead of --dedupe.documents.attribute_name it should work.

TTTTao725 commented 8 months ago

Thanks Peter, I have used --dedupe.paragraphs.attribute_name 'bff_duplicate_paragraph_spans' to filter documents before, but it was quite heavy: 5000 documents -> 44 documents after using this. So I wondered if I used the wrong command and tested --dedupe.documents.attribute_name 'duplicate_documents'.

Good to know the right way of using dedupe.documents :)

Happy new year! /Tao

TTTTao725 commented 8 months ago

Hi!

Thank you for the bug report. To clarify, is this using the version currently available on PyPI, or is it running on main branch?

The one I'm using: pip install git+https://github.com/allenai/dolma.git@28727302a34225f4ce70ce97940cb18ad9a91583

soldni commented 5 months ago

Closing since this should be now fixed in main/latest release. LMK if that's not the case!

allenai / dolma

deduplication examples does not work #96