Neuron Profiler not working on basic training workload

aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services

https://aws.amazon.com/machine-learning/neuron/

Other

423 stars 136 forks source link

Neuron Profiler not working on basic training workload #746

Open czmrand opened 9 months ago

czmrand commented 9 months ago

We are attempting to run the neuron profiler on ec2 trn1 according to the documentation. We are able to run the published tutorials 1 and 2 but fail when we try to run on our script trn_vit.txt. (Requires timm installation.)

When setting to 'trace' mode (torchrun run.py -p trace), we get a RESOURCE_EXHAUSTED crash: trace_error.txt

When setting to 'operator' mode (torchrun run.py -p operator), the script runs (the model trains successfully), but there is no profiling output, even though we do see profiling related prints... operator_output.txt

aws-rhsoln commented 9 months ago

Thank you reporting the issue. We are trying to replicate the issue on our end.

aws-rhsoln commented 9 months ago

Update: We have managed to reproduce the issue. We are looking into it and should have an update once we have a fix. Thank you!

renos commented 1 month ago

@aws-rhsoln has this been fixed? I'm having the same issue when trying to compile a falcon 2 model

aws-rhsoln commented 5 days ago

The issue here is, we are trying to profile a multi-worker training job using trace. We currently do not support multi-worker trace. This is something on our roadmap. Please monitor the release docs for more updates.