exact_deduplication.py out_of_memory

simplew2011 commented 2 weeks ago

python exact_deduplication.py --device gpu --files-per-partition 1
process 2 files in .jsonl format, 10GB per file
how handel this large datasets

Torch is using RMM memory pool
Reading 2 files
Traceback (most recent call last):
  File "/home/wzp/code/LLMData/open_source/NeMo-Curator/examples/exact_deduplication.py", line 88, in <module>
    main(attach_args().parse_args())
  File "/home/wzp/code/LLMData/open_source/NeMo-Curator/examples/exact_deduplication.py", line 43, in main
    input_dataset = DocumentDataset.read_json(dataset_dir, backend=backend)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wzp/code/LLMData/open_source/NeMo-Curator/nemo_curator/datasets/doc_dataset.py", line 45, in read_json
    _read_json_or_parquet(
  File "/home/wzp/code/LLMData/open_source/NeMo-Curator/nemo_curator/datasets/doc_dataset.py", line 197, in _read_json_or_parquet
    raw_data = read_data(
               ^^^^^^^^^^
  File "/home/wzp/code/LLMData/open_source/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 296, in read_data
    return dd.from_map(
           ^^^^^^^^^^^^
  File "/home/wzp/miniconda3/envs/rapids/lib/python3.11/site-packages/dask_expr/_collection.py", line 5859, in from_map
    result = new_collection(
             ^^^^^^^^^^^^^^^
  File "/home/wzp/miniconda3/envs/rapids/lib/python3.11/site-packages/dask_expr/_collection.py", line 4764, in new_collection
    meta = expr._meta
           ^^^^^^^^^^
  File "/home/wzp/miniconda3/envs/rapids/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/wzp/miniconda3/envs/rapids/lib/python3.11/site-packages/dask_expr/io/io.py", line 241, in _meta
    meta = self.func(*vals, *self.args, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wzp/code/LLMData/open_source/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 228, in read_single_partition
    df = read_f(file, **read_kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wzp/miniconda3/envs/rapids/lib/python3.11/site-packages/cudf/io/json.py", line 96, in read_json
    df = libjson.read_json(
         ^^^^^^^^^^^^^^^^^^
  File "json.pyx", line 45, in cudf._lib.json.read_json
  File "json.pyx", line 137, in cudf._lib.json.read_json
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /home/wzp/miniconda3/envs/rapids/include/rmm/mr/device/cuda_memory_resource.hpp

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:01:00.0 Off |                  N/A |
| 32%   29C    P8               30W / 350W|    300MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         On | 00000000:25:00.0 Off |                  N/A |
| 32%   30C    P8               30W / 350W|      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090         On | 00000000:41:00.0 Off |                  N/A |
| 30%   33C    P8               35W / 350W|      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090         On | 00000000:61:00.0 Off |                  N/A |
| 30%   30C    P8               34W / 350W|      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 3090         On | 00000000:81:00.0 Off |                  N/A |
| 41%   27C    P8               17W / 350W|      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 3090         On | 00000000:A1:00.0 Off |                  N/A |
| 32%   26C    P8               22W / 350W|      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 3090         On | 00000000:C1:00.0 Off |                  N/A |
| 30%   29C    P8               24W / 350W|      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 3090         On | 00000000:E1:00.0 Off |                  N/A |
| 30%   29C    P8               24W / 350W|      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

simplew2011 commented 2 weeks ago

nemo_curator 0.3.0 nemo_toolkit 1.23.0

simplew2011 commented 2 weeks ago

Do I need the input files to be very small? I divided the 20GB files into 1GB small files each, and the program ran normally

simplew2011 commented 2 weeks ago

run fuzzy_deduplication.py
run the 1GB files (20) mentioned above
error

python examples/fuzzy_deduplication.py --device gpu

Torch is using RMM memory pool <Client: 'tcp://127.0.0.1:40947' processes=8 threads=8, memory=251.80 GiB> http://127.0.0.1:8787/status Reading 20 files Stage1: Starting Minhash + LSH computation Traceback (most recent call last): File "/home/wzp/code/LLMData/open_source/NeMo-Curator/examples/fuzzy_deduplication.py", line 119, in main(attach_args().parse_args()) File "/home/wzp/code/LLMData/open_source/NeMo-Curator/examples/fuzzy_deduplication.py", line 89, in main duplicates = fuzzy_dup(dataset=input_dataset) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wzp/code/LLMData/open_source/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 480, in call buckets_df = minhashLSH(dataset) ^^^^^^^^^^^^^^^^^^^ File "/home/wzp/code/LLMData/open_source/NeMo-Curator/nemo_curator/modules/meta.py", line 22, in call dataset = module(dataset) ^^^^^^^^^^^^^^^ File "/home/wzp/code/LLMData/open_source/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 162, in call result["_minhash_signature"] = dataset.df[self.text_field].map_partitions(


TypeError: 'Scalar' object does not support item assignment

ayushdg commented 2 weeks ago

Thanks for raising. Answering some of your questions inline:

Do I need the input files to be very small? I divided the 20GB files into 1GB small files each, and the program ran normally

Yes since we cannot split jsonl files into smaller subsets during reading, it's recommended to work with jsonl files smaller than 2GB. Anywhere b/w 256MB - 1GB is typically a good size for a single json file. nemo_curator has a make_data_shards CLI tool to help split larger jsonl files into smaller ones: https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/make_data_shards.py

File "/home/wzp/code/LLMData/open_source/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 162, in call result["_minhash_signature"] = dataset.df[self.text_field].map_partitions( ^^^^^^^^^^^^^^^^^^^^^^ TypeError: 'Scalar' object does not support item assignment

Can you try running export DASK_DATAFRAME__QUERY_PLANNING=False before running the fuzzy_deduplication script to see if that fixes the issue? Can you also share the dask version in your environment. Importing dask before nemo_curator enables query_planning and our checks don't effectively detect that planning is enabled. I'm attempting to improve this behavior in #107 and there's some discussion to better check for this in dask/dask#11175 as well.

simplew2011 commented 2 weeks ago

export DASK_DATAFRAME__QUERY_PLANNING=False
This is effective，thanks your help.

simplew2011 commented 2 weeks ago

dask 2024.5.1 dask-cuda 24.6.0 dask-cudf 24.6.0 dask-expr 1.1.1 dask-mpi 2022.4.0 datasets 2.19.2 datashader 0.16.2 defusedxml 0.7.1 dill 0.3.8 distributed 2024.5.1 distributed-ucxx 0.38.0

NVIDIA / NeMo-Curator

exact_deduplication.py out_of_memory #112