allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
972 stars 107 forks source link

[Issue: `dolma.core.errors.DolmaFatalError` in Step 1: Run Taggers] #194

Closed yushengsu-thu closed 2 months ago

yushengsu-thu commented 2 months ago

Hello @soldni , I have one more question. When I execute Step 1: Run Taggers,

dolma tag \
    --documents "wikipedia/v0/documents/*" \
    --experiment exp \ # optional; assigning a name groups taggers in a single directory
    --taggers random_number_v1 \
              cld2_en_paragraph_with_doc_score_v2 \
              ft_lang_id_en_paragraph_with_doc_score_v2 \
              char_length_with_paragraphs_v1 \
              whitespace_tokenizer_with_paragraphs_v1 \
    --processes 16   # run on 96 cores

I encounter the following issue:

Traceback (most recent call last):
  File "/home/yushensu/miniconda3/envs/data/bin/dolma", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/cli/main.py", line 93, in main
    return cli.run_from_args(args=args, config=config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/cli/__init__.py", line 192, in run_from_args
    return cls.run(parsed_config)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/cli/tagger.py", line 129, in run
    create_and_run_tagger(
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/core/runtime.py", line 483, in create_and_run_tagger
    tagger_processor(
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/core/parallel.py", line 516, in __call__
    fn(
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/core/parallel.py", line 439, in _multiprocessing_run_all
    result.get()
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
dolma.core.errors.DolmaFatalError: Failed to process wikipedia/v0/documents/wiki_00.gz due to ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword. 

My env:

Python 3.12.4
numpy 2.1.0
dolma 1.0.11

Is this issue coming from the processed (I used scripts/make_wikipedia.py) data wikipedia/v0/documents/wiki_00.gz or the codebases in dolma? Do you have any suggestion to mitigate or solve this issue?

soldni commented 2 months ago

Oh I think it's because numpy 2.x is incompatible with numpy 1.x APIs. Cutting a quick fix and a new release (dolma 1.0.12) momentarily to fix that.

yushengsu-thu commented 2 months ago

@soldni thanks for your reply. I found this issue comes from Step 0: Obtain Wikipedia processed data because of its used package wikiextractor

Now I have found a temporary solution: set the python (from 3.12 --> 3.11) and pkgs in the following version:

Python 3.11.9
numpy 1.26.3
wikiextractor 3.0.6
dolma 1.0.11

Then, re-run the Step 0: Obtain Wikipedia

python scripts/make_wikipedia.py \
  --output wikipedia \
  --date 20231001 \
  --lang simple \
  --processes 16

and use its processed data to conduct the Step 1: Run Taggers that can mitigate this issus.