AI-Hypercomputer / maxtext

A simple, performant and scalable Jax LLM!
Apache License 2.0
1.53k stars 293 forks source link

Support nsys profiler upload in all cases #911

Open gobbleturk opened 1 month ago

gobbleturk commented 1 month ago

For both jax.profiler (profiler=xplane in maxtext) and a GPU nsys profiler (profiler=nsys in maxtext) we upload the profile to the base_output_directory (source)

Typically this directory is GCS, it can also be local. However for the nsys profiler we hardcode the uploader to use gsutil source, which has two problems

  1. Output directory may not be GCS, so gsutil is not applicable
  2. Hosts may not have gsutil installed, since gsutil is not in requirements.txt

We should modify the nsys profile upload to work in all cases.

Additional context - https://github.com/AI-Hypercomputer/maxtext/pull/909 was added as a temporary fix for 2 - we won't upload the profile when gsutil is missing, so training may continue