NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

find_pii_and_deidentify example fails #85

Open randerzander opened 1 month ago

randerzander commented 1 month ago

I'm trying to run the PII example here.

# for gpu
python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py --device gpu

# for cpu
python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py

On CPU, I get memory warnings and eventual worker deaths without producing output:

2024-05-28 14:41:18,511 - distributed.nanny - WARNING - Restarting worker                      [180/2695]

2024-05-28 14:41:19 INFO:Loaded recognizer: EmailRecognizer                                              

2024-05-28 14:41:19 INFO:Loaded recognizer: PhoneRecognizer                                              

2024-05-28 14:41:19 INFO:Loaded recognizer: SpacyRecognizer                                              

2024-05-28 14:41:19 INFO:Loaded recognizer: UsSsnRecognizer                                              

2024-05-28 14:41:19 INFO:Loaded recognizer: CreditCardRecognizer                                         

2024-05-28 14:41:19 INFO:Loaded recognizer: IpRecognizer                                                 

2024-05-28 14:41:19 WARNING:model_to_presidio_entity_mapping is missing from configuration, using default

2024-05-28 14:41:19 WARNING:low_score_entity_names is missing from configuration, using default          

2024-05-28 14:41:22,407 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may in

dicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/lat

est/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.68

 GiB -- Worker memory limit: 5.25 GiB                                                                    

2024-05-28 14:41:23,165 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing wo

rker.  Process memory: 4.27 GiB -- Worker memory limit: 5.25 GiB                                         

2024-05-28 14:41:24,134 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:33953 (pid=14243) e

xceeded 95% memory budget. Restarting...                                                                 

2024-05-28 14:41:24,471 - distributed.scheduler - ERROR - Task ('getitem-modify_document-assign-64f0e480e

2b64dd94f34c05c2de0918e', 0) marked as failed because 4 workers died while trying to run it              

2024-05-28 14:41:24,472 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:33953' cause

d the cluster to lose already computed task(s), which will be recomputed elsewhere: {('frompandas-f7a5910

31e0ada9d2c8cba1c8468dd66', 0)} (stimulus_id='handle-worker-cleanup-1716907284.4715889')                 

Traceback (most recent call last):                                                                       

  File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 52, in <module>                   

    console_script()                                                                                     

  File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 48, in console_script             

    modified_dataset.df.to_json("output_files/*.jsonl", lines=True, orient="records")                    

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask_expr/_collection.py", line 2380, in to_j

son                                                                                                      

    return to_json(self, filename, *args, **kwargs)                                                      

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/dataframe/io/json.py", line 96, in to_js

on                                                                                                       

    return list(dask_compute(*parts, **compute_kwargs))                                                  

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py", line 661, in compute          

    results = schedule(dsk, keys, **kwargs)                                                              

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/client.py", line 2232, in _gather

    raise exception.with_traceback(traceback)                                                            

distributed.scheduler.KilledWorker: Attempted to run task ('getitem-modify_document-assign-64f0e480e2b64d

d94f34c05c2de0918e', 0) on 4 different workers, but all those workers died while running it. The last wor

ker that attempt to run the task was tcp://127.0.0.1:33953. Inspecting worker logs is often a good next s

tep to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.h

tml.                                                                                                     

2024-05-28 14:41:24,778 - distributed.nanny - WARNING - Restarting worker                                

2024-05-28 14:41:24,959 - distributed.worker - ERROR - Failed to communicate with scheduler during heartb

eat.

There's a longer trace, but it's just more restarting workers before the cluster shuts down.

In GPU mode, it takes some time before failing with a pytorch error:

python examples/find_pii_and_deidentify.py --device gpu

Traceback (most recent call last):

  File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 52, in <module>

    console_script()

  File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 30, in console_script

    _ = get_client(**parse_client_args(arguments))

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 150, in get_client

    return start_dask_gpu_local_cluster(

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 75, in start_dask_gpu_local_cluster

    _set_torch_to_use_rmm()

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 175, in _set_torch_to_use_rmm

    torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/torch/cuda/memory.py", line 905, in change_current_allocator

    torch._C._cuda_changeCurrentAllocator(allocator.allocator())

AttributeError: module 'torch._C' has no attribute '_cuda_changeCurrentAllocator'
ayushdg commented 1 month ago

For the GPU case, the error seems to indicate that torch could not change the cuda allocator. One of the reasons this can happen when only the CPU flavor of Torch is installed without GPU support. Is it possible to check if the following command works in the environment:

import torch
torch.cuda.is_available()