NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

Disable PyTorch Compile Multiprocessing #34

Closed ryantwolf closed 2 months ago

ryantwolf commented 2 months ago

Addresses #31 by disabling multiprocessing for PyTorch according to advice gathered from this issue. Essentially, we need to set os.environ["TORCHINDUCTOR_COMPILE_THREADS"] = "1" before PyTorch is imported and this variable is initialized. I also moved the nemo import statement to inside of the filter so that any performance impact is minimized (though I observed no speed degradation in my tests).