NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

Only import PII constants during Curator import #61

Closed ayushdg closed 1 month ago

ayushdg commented 1 month ago

One approach to fixing #59. The root cause seems to come from the fact that importing space (which imports and calls a cupy cuda function) somehow impacts the state of the system and prevent a cluster starting up using all available GPUs on the machine.

Currently the only imports in Curator that lead to this situation is importing DEFAULT_LANGUAGE from the pii modules which transitively ends up importing presidio->spacy.

This pr moves these constants to a separate file so that we don't end up importing all other dependencies during Curator import.

VibhuJawa commented 1 month ago

Tested locally, works, please go ahead and merge.