awslabs / sagemaker-debugger

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Apache License 2.0
161 stars 83 forks source link

Cannot run a custom container using smdistributed/dataparallel unless USE_SMDEBUG is turned off #609

Open plamb-viso opened 2 years ago

plamb-viso commented 2 years ago

After countless hours of trying to get an Estimator() to run on a custom image_uri in smdistributed/dataparllel mode (it was failing on trying to import any non-sagemaker-DLC library), I finally discovered buried in the sagemaker.huggingface.HuggingFace estimator that in its API req to sagemaker, it adds the env var

"USE_SMDEBUG": "0"

I added this to my custom docker container and suddenly everything worked. Imports from custom libraries worked no problem.

Is this documented anywhere?