Open czmrand opened 9 months ago
Thank you reporting the issue. We are trying to replicate the issue on our end.
Update: We have managed to reproduce the issue. We are looking into it and should have an update once we have a fix. Thank you!
@aws-rhsoln has this been fixed? I'm having the same issue when trying to compile a falcon 2 model
The issue here is, we are trying to profile a multi-worker training job using trace. We currently do not support multi-worker trace. This is something on our roadmap. Please monitor the release docs for more updates.
We are attempting to run the neuron profiler on ec2 trn1 according to the documentation. We are able to run the published tutorials 1 and 2 but fail when we try to run on our script trn_vit.txt. (Requires timm installation.)
When setting to 'trace' mode (torchrun run.py -p trace), we get a RESOURCE_EXHAUSTED crash: trace_error.txt
When setting to 'operator' mode (torchrun run.py -p operator), the script runs (the model trains successfully), but there is no profiling output, even though we do see profiling related prints... operator_output.txt