NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
477 stars 57 forks source link

Fix nemo_curator import in CPU only environment when GPU packages are installed. #123

Closed ayushdg closed 3 months ago

ayushdg commented 3 months ago

Fixes #109

Description

27 Added support for installing a CPU only build of nemo_curator that works in cpu environments. However it didn't account for the case where the GPU version of curator was installed but the package was being used in CPU only environments.

One case this happens is when using the NeMo-FW container in CPU only environments. This pr extends the safe_import mechanism to fail on importError instead of ModuleNotFoundError for cases where the module was present but import failed due to missing GPUs.

One downside to this approach is that there's an implicit assumption that the importError on GPU packages stem from a missing GPU environment or missing packages, so we may wrongly classify other issues on import of these packages into the same bucket.

Usage

pip install nemo-curator[cuda12x]
import nemo_curator # in a cpu only environment passes

nemo_curator.FuzzyDuplicates() # raises relevant error message to be present in GPU environments.

Checklist