Closed simplew2011 closed 2 weeks ago
nemo_curator 0.3.0 nemo_toolkit 1.23.0
python examples/fuzzy_deduplication.py --device gpu
Torch is using RMM memory pool
<Client: 'tcp://127.0.0.1:40947' processes=8 threads=8, memory=251.80 GiB> http://127.0.0.1:8787/status
Reading 20 files
Stage1: Starting Minhash + LSH computation
Traceback (most recent call last):
File "/home/wzp/code/LLMData/open_source/NeMo-Curator/examples/fuzzy_deduplication.py", line 119, in
TypeError: 'Scalar' object does not support item assignment
Thanks for raising. Answering some of your questions inline:
Do I need the input files to be very small? I divided the 20GB files into 1GB small files each, and the program ran normally
Yes since we cannot split jsonl files into smaller subsets during reading, it's recommended to work with jsonl files smaller than 2GB. Anywhere b/w 256MB - 1GB is typically a good size for a single json file.
nemo_curator has a make_data_shards
CLI tool to help split larger jsonl files into smaller ones: https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/make_data_shards.py
File "/home/wzp/code/LLMData/open_source/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 162, in call result["_minhash_signature"] = dataset.df[self.text_field].map_partitions( ^^^^^^^^^^^^^^^^^^^^^^ TypeError: 'Scalar' object does not support item assignment
Can you try running export DASK_DATAFRAME__QUERY_PLANNING=False
before running the fuzzy_deduplication
script to see if that fixes the issue?
Can you also share the dask
version in your environment. Importing dask before nemo_curator enables query_planning and our checks don't effectively detect that planning is enabled. I'm attempting to improve this behavior in #107 and there's some discussion to better check for this in dask/dask#11175 as well.
dask 2024.5.1 dask-cuda 24.6.0 dask-cudf 24.6.0 dask-expr 1.1.1 dask-mpi 2022.4.0 datasets 2.19.2 datashader 0.16.2 defusedxml 0.7.1 dill 0.3.8 distributed 2024.5.1 distributed-ucxx 0.38.0