AI-Hypercomputer / maxtext

A simple, performant and scalable Jax LLM!
Apache License 2.0
1.47k stars 275 forks source link

move gsutil copy to condtional to avoid breakages #909

Closed kocchop closed 16 hours ago

kocchop commented 1 day ago

gsutil is not in the requirements list, hence not installed by default. The copy command is breaking for the containers which don't have it by default. Moved it under a conditonal

gobbleturk commented 1 day ago

Isn't it better to error out when a user requests profiling but the profiling fails? User can set profiler='' (default, no profiling) to avoid profiling instead of this check?

This error points to a real bug in maxtext - we should either ensure gsutil is installed or perhaps can use a different gcs API (upload_blob)

kocchop commented 1 day ago

Hi @gobbleturk the profiling works fine as is. This line just copies the generated .nsys-rep file to some directory. Without gsutil this copy fails and the run breaks. If for some reason the profiling breaks, it should break even with this PR. In sum, this PR does not ignore profiling failures, rather it avoids the breakages caused by gsutil file copy.

gobbleturk commented 1 day ago

Hi @gobbleturk the profiling works fine as is. This line just copies the generated .nsys-rep file to some directory. Without gsutil this copy fails and the run breaks. If for some reason the profiling breaks, it should break even with this PR. In sum, this PR does not ignore profiling failures, rather it avoids the breakages caused by gsutil file copy.

Don't we need to move the profile to GCS? How are we able to use the profile otherwise?

kocchop commented 1 day ago

@gobbleturk yes, on clusters using gcs, it is needed indeed. Hence, the nightly maxtext container made through this script installs gsutil etc. However, for other clusters it may not be necessary as there could be different file system mounts. Without any checks, it leads to breakages.

Another way could be to enforce the gsutil to be on the requirements list. Even with that, we might need to change the behavior for non-gcs scenarios.

gobbleturk commented 1 day ago

script

Ya this makes sense. For TPUs the profile will be sent to base_output_directory which is configurable (can be GCS or local), I think we should change the GPU path to also be configurable, as it won't always be GCS.

gobbleturk commented 1 day ago

Will approve for now to unblock nsys profiling use cases, as we work on longer term fix

Created https://github.com/AI-Hypercomputer/maxtext/issues/911 to track longer term fix