Closed VibhuJawa closed 1 month ago
Does it make sense to generalize this and move it to a module similar to DistributedDataClassifier or do you feel it's better as a standalone example?
I think we can start with stand-alone example, just to show folks how to do generation models(like translation) with NeMo-Curator.
I think a module abstracts like DistributedDataClassifier
away too much logic, it is useful for our models which we release but unsure if we should do the same abstraction for other models. I think as a first step we can always start with an example and then expand from there.
cc: @sarahyurick if you want to take a look as well.
There is a quick change i want to make before merging in, please hold off on merging
@ayushdg/@sarahyurick, Read for review again. Made the minor change
With the changes require for passing translation config to CustomModel, I am getting following warning followed by an error :
Warning
2024-06-14 02:42:03,937 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.86 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,006 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.89 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,027 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.85 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,177 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.84 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,031 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,093 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 6.71 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,110 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,129 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.88 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,251 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.92 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,359 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker. Process memory: 6.76 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,364 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:43691 (pid=1215242) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,377 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:35729 (pid=1215237) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,429 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,430 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:38207 (pid=1215234) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,513 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:42121 (pid=1215230) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,545 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 6.71 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,733 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:45741 (pid=1215246) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,816 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,864 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:40093 (pid=1215249) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,868 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,926 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,979 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:07,062 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:07,135 - distributed.nanny - WARNING - Restarting worker
Error
2024-06-14 02:42:28,219 - distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/core.py", line 175, in loads
return msgpack.loads(
File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/core.py", line 172, in _decode_default
return pickle.loads(sub_header["pickled-obj"], buffers=sub_frames)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 94, in loads
return pickle.loads(x, buffers=buffers)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/abc.py", line 182, in host_deserialize
obj = cls.device_deserialize(header, frames)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/abc.py", line 136, in device_deserialize
return typ.deserialize(header, frames)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/dataframe.py", line 1178, in deserialize
obj = super().deserialize(
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
result = func(*args, **kwargs)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/frame.py", line 113, in deserialize
columns = deserialize_columns(header["columns"], frames)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 2418, in deserialize_columns
colobj = col_typ.deserialize(meta, frames[:col_frame_count])
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 1209, in deserialize
data, frames = unpack(header["data"], frames)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 1197, in unpack
obj = klass.deserialize(header, frames[:count])
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/buffer.py", line 444, in deserialize
owner = owner_type._from_host_memory(frame)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/spillable_buffer.py", line 178, in _from_host_memory
ret._finalize_init(ptr_desc={"type": "cpu", "memoryview": data})
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/spillable_buffer.py", line 113, in _finalize_init
raise ValueError(
ValueError: cannot create <class 'cudf.core.buffer.spillable_buffer.SpillableBufferOwner'> without a global spill manager
System Info
nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-b65d5e9d-eeaa-f149-71e9-86895ba5d11d)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-70a9c47b-e350-a7f4-4d67-b90c3b8cf39c)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-22501371-c67b-1183-65bd-cb03bc220f3b)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-1025102a-cde3-bb5a-792d-941ef232cf23)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-ac9a8547-e8ad-de83-7fe1-0e44ce1b2375)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-c403abb7-f38b-548d-e8bf-e1f890b2309f)
Why I am getting this and how to resolve it?
After Adding
os.environ["CUDF_SPILL"] = "on"
No change in results.
Hi @VibhuJawa , I getting an error regarding mismatch in output tensor sizes. When I am providing following translation config :
translation_config = TranslationConfig(
pretrained_model_name_or_path=args.pretrained_model_name_or_path,
max_length=256,
num_beams=5,
autocast=args.autocast,
)
it fails with error as :
2024-06-18 23:40:52,501 - distributed.worker - WARNING - Compute Failed
Key: ('single_partition_write_with_filename-8438c8f8f730c2b1d33a17630e343c07', 6)
Function: subgraph_callable-02f8b235-601a-4d78-be44-a70bf33d
args: ('outputs/', 'combine_text-60b511141364d07a62bf9fcf20d113b9', 'translate_tokens-fd51c23e5eefc82c0252f1a8e993a01e', '<crossfit.backend.torch.op.base.Predictor object a-cad1b684c98fcb3af6e5fe49db90bdc9', {'number': 6, 'division': None}, '<crossfit.op.tokenize.Tokenizer object at 0x1552ad-aede8748151e916cba92a7e6d2baedd2', {'number': 6, 'division': None}, 'preprocess_df-ee75cb0cd0a9eb07b9cb8e3dabef3b9e', 'process_input_text-7e46f561991596c150a5386e9b2fc247', 'read_single_partition-0103a9e25a6103736ebfd21f8468db77', ['inputs/text_ag.jsonl'])
kwargs: {}
Exception: "RuntimeError('Sizes of tensors must match except in dimension 0. Expected size 80 but got size 202 for tensor number 1 in the list.')"
Traceback (most recent call last):
File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 380, in <module>
main()
File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 322, in main
main_func(args)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 309, in main_func
write_to_disk(
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 469, in write_to_disk
output = output.compute()
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/dask/base.py", line 379, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/dask/base.py", line 665, in compute
results = schedule(dsk, keys, **kwargs)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/crossfit/crossfit/op/base.py", line 94, in __call__
output = self.call(data, *args, partition_info=partition_info, **kwargs)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/lustre/fsw/portfolios/llmservice/users/uahmed/crossfit/crossfit/backend/torch/op/base.py", line 90, in call
outputs = cp.asarray(torch.cat(all_outputs_ls, dim=0))
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 80 but got size 202 for tensor number 1 in the list.
It seems to me this error is coming from crossfit from here
Moreover this type of error persist if we change max_length = 20 in TranslationConfig(from above) and it gave :
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 20 but got size 18 for tensor number 11 in the list.
cc @ayushdg
cc @ayushdg
This should be fixed after https://github.com/NVIDIA/NeMo-Curator/pull/96/commits/2b7c794276918a854bdea127972cbd41ddbf94c7
Closing in favor of #189
Description
This PR adds a translation module based on Umair ahmeds initial work. This PR adds :
Checklist
Example Command: