NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
477 stars 57 forks source link

[REVIEW] Add Translation Module Example #96

Closed VibhuJawa closed 1 month ago

VibhuJawa commented 3 months ago

Description

This PR adds a translation module based on Umair ahmeds initial work. This PR adds :

Checklist

Example Command:


   python3 translation_example.py \
  --input-data-dir /raid/vjawa/subset_CC-MAIN-2023-14_english/ --input-file-type jsonl \
  --output-data-dir /raid/vjawa/translation_CC-MAIN-2023-14_english --output-file-type parquet \
  --autocast \
  --pretrained-model-name-or-path /raid/vjawa/indictrans2-en-indic-1B/ 
VibhuJawa commented 3 months ago

Does it make sense to generalize this and move it to a module similar to DistributedDataClassifier or do you feel it's better as a standalone example?

I think we can start with stand-alone example, just to show folks how to do generation models(like translation) with NeMo-Curator.

I think a module abstracts like DistributedDataClassifier away too much logic, it is useful for our models which we release but unsure if we should do the same abstraction for other models. I think as a first step we can always start with an example and then expand from there.

ayushdg commented 3 months ago

cc: @sarahyurick if you want to take a look as well.

VibhuJawa commented 3 months ago

There is a quick change i want to make before merging in, please hold off on merging

VibhuJawa commented 3 months ago

@ayushdg/@sarahyurick, Read for review again. Made the minor change

uahmed93 commented 3 months ago

With the changes require for passing translation config to CustomModel, I am getting following warning followed by an error :

Warning

2024-06-14 02:42:03,937 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.86 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,006 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.89 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,027 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.85 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:04,177 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.84 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,031 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,093 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.71 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,110 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,129 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.88 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,251 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.92 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:05,359 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 6.76 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,364 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:43691 (pid=1215242) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,377 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:35729 (pid=1215237) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,429 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.68 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,430 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:38207 (pid=1215234) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,513 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:42121 (pid=1215230) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,545 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.71 GiB -- Worker memory limit: 8.33 GiB
2024-06-14 02:42:06,733 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:45741 (pid=1215246) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,816 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,864 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:40093 (pid=1215249) exceeded 95% memory budget. Restarting...
2024-06-14 02:42:06,868 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,926 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:06,979 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:07,062 - distributed.nanny - WARNING - Restarting worker
2024-06-14 02:42:07,135 - distributed.nanny - WARNING - Restarting worker

Error

2024-06-14 02:42:28,219 - distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/core.py", line 175, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/core.py", line 172, in _decode_default
    return pickle.loads(sub_header["pickled-obj"], buffers=sub_frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 94, in loads
    return pickle.loads(x, buffers=buffers)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/abc.py", line 182, in host_deserialize
    obj = cls.device_deserialize(header, frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/abc.py", line 136, in device_deserialize
    return typ.deserialize(header, frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/dataframe.py", line 1178, in deserialize
    obj = super().deserialize(
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/frame.py", line 113, in deserialize
    columns = deserialize_columns(header["columns"], frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 2418, in deserialize_columns
    colobj = col_typ.deserialize(meta, frames[:col_frame_count])
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 1209, in deserialize
    data, frames = unpack(header["data"], frames)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/column/column.py", line 1197, in unpack
    obj = klass.deserialize(header, frames[:count])
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/buffer.py", line 444, in deserialize
    owner = owner_type._from_host_memory(frame)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/spillable_buffer.py", line 178, in _from_host_memory
    ret._finalize_init(ptr_desc={"type": "cpu", "memoryview": data})
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/cudf/core/buffer/spillable_buffer.py", line 113, in _finalize_init
    raise ValueError(
ValueError: cannot create <class 'cudf.core.buffer.spillable_buffer.SpillableBufferOwner'> without a global spill manager

System Info

nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-b65d5e9d-eeaa-f149-71e9-86895ba5d11d)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-70a9c47b-e350-a7f4-4d67-b90c3b8cf39c)
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-22501371-c67b-1183-65bd-cb03bc220f3b)
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-1025102a-cde3-bb5a-792d-941ef232cf23)
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-ac9a8547-e8ad-de83-7fe1-0e44ce1b2375)
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-c403abb7-f38b-548d-e8bf-e1f890b2309f)

Why I am getting this and how to resolve it?

After Adding

os.environ["CUDF_SPILL"] = "on"

No change in results.

uahmed93 commented 3 months ago

Hi @VibhuJawa , I getting an error regarding mismatch in output tensor sizes. When I am providing following translation config :

translation_config = TranslationConfig(
        pretrained_model_name_or_path=args.pretrained_model_name_or_path,
        max_length=256,
        num_beams=5,
        autocast=args.autocast,
    )

it fails with error as :

2024-06-18 23:40:52,501 - distributed.worker - WARNING - Compute Failed
Key:       ('single_partition_write_with_filename-8438c8f8f730c2b1d33a17630e343c07', 6)
Function:  subgraph_callable-02f8b235-601a-4d78-be44-a70bf33d
args:      ('outputs/', 'combine_text-60b511141364d07a62bf9fcf20d113b9', 'translate_tokens-fd51c23e5eefc82c0252f1a8e993a01e', '<crossfit.backend.torch.op.base.Predictor object a-cad1b684c98fcb3af6e5fe49db90bdc9', {'number': 6, 'division': None}, '<crossfit.op.tokenize.Tokenizer object at 0x1552ad-aede8748151e916cba92a7e6d2baedd2', {'number': 6, 'division': None}, 'preprocess_df-ee75cb0cd0a9eb07b9cb8e3dabef3b9e', 'process_input_text-7e46f561991596c150a5386e9b2fc247', 'read_single_partition-0103a9e25a6103736ebfd21f8468db77', ['inputs/text_ag.jsonl'])
kwargs:    {}
Exception: "RuntimeError('Sizes of tensors must match except in dimension 0. Expected size 80 but got size 202 for tensor number 1 in the list.')"

Traceback (most recent call last):
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 380, in <module>
    main()
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 322, in main
    main_func(args)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/new_tr_ex.py", line 309, in main_func
    write_to_disk(
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 469, in write_to_disk
    output = output.compute()
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/dask/base.py", line 379, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/dask/base.py", line 665, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/crossfit/crossfit/op/base.py", line 94, in __call__
    output = self.call(data, *args, partition_info=partition_info, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/test_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/lustre/fsw/portfolios/llmservice/users/uahmed/crossfit/crossfit/backend/torch/op/base.py", line 90, in call
    outputs = cp.asarray(torch.cat(all_outputs_ls, dim=0))
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 80 but got size 202 for tensor number 1 in the list.

It seems to me this error is coming from crossfit from here

Moreover this type of error persist if we change max_length = 20 in TranslationConfig(from above) and it gave :

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 20 but got size 18 for tensor number 11 in the list.

cc @ayushdg

VibhuJawa commented 2 months ago

cc @ayushdg

This should be fixed after https://github.com/NVIDIA/NeMo-Curator/pull/96/commits/2b7c794276918a854bdea127972cbd41ddbf94c7

ryantwolf commented 1 month ago

Closing in favor of #189