awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
556 stars 185 forks source link

Persistent bug during dp-bert-large-pretrain example #403

Open Gall-oDrone opened 5 months ago

Gall-oDrone commented 5 months ago

Description

I'm unable to run the trainium-inferentia BERT pretrain model. Following error is showing up during building:

Traceback (most recent call last): File "/home/ec2-user/.local/bin/torchx", line 8, in <module> sys.exit(main()) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 116, in main run_main(get_sub_cmds(), argv) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 112, in run_main args.func(args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 248, in run self._run(runner, args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 208, in _run app_handle = runner.run_component( File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 186, in run_component return self.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 278, in schedule app_id = sched.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/schedulers/kubernetes_scheduler.py", line 593, in schedule resp = self._custom_objects_api().create_namespaced_custom_object( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 231, in create_namespaced_custom_object return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs) # noqa: E501 File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 354, in create_namespaced_custom_object_with_http_info return self.api_client.call_api( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({'Audit-Id': '9ea0bf3e-2327-45ae-aefa-0965f38155ff', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '344539cc-94a8-443b-82c3-e6ffd6feb173', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'bb6f7b9d-abeb-49b1-bcda-ff2bc8c180bf', 'Date': 'Tue, 23 Jan 2024 22:43:20 GMT', 'Content-Length': '232'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400}

EKS Data blueprint was provided from [https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/trainium](EKS Data url)

I re-initialized the project several times in both Cloud9 and in my local system. Both with the same result. I re-attempt the terraform ./install.sh file.

Versions

Reproduction Code [Required]

cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain chomd +x 2-bert-pretrain-precompile.sh ./2-bert-pretrain-precompile.sh

Workspace used: Cloud9 following along this Workshop [https://www.eksworkshop.com/docs/introduction/setup/your-account/]( EKS Workshop url)

List steps in order that led up to the issue you encountered

`cd data-on-eks/ai-ml/trainium/ && chmod +x install.sh

./install.sh

cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain chomd +x 1-bert-pretrain-build-image.sh ./1-bert-pretrain-build-image.sh

kubectl exec -i -t -n default aws-cli-cmd-shell -c app -- sh -c "clear; (bash || ash || sh)"

yum install tar cd /data aws s3 cp s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar . --no-sign-request chmod 744 bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar tar xvf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar `

Expected behavior

Pretrain Bert Model successfully built

Actual behavior

Traceback (most recent call last): File "/home/ec2-user/.local/bin/torchx", line 8, in <module> sys.exit(main()) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 116, in main run_main(get_sub_cmds(), argv) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 112, in run_main args.func(args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 248, in run self._run(runner, args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 208, in _run app_handle = runner.run_component( File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 186, in run_component return self.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 278, in schedule app_id = sched.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/schedulers/kubernetes_scheduler.py", line 593, in schedule resp = self._custom_objects_api().create_namespaced_custom_object( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 231, in create_namespaced_custom_object return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs) # noqa: E501 File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 354, in create_namespaced_custom_object_with_http_info return self.api_client.call_api( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({'Audit-Id': '9ea0bf3e-2327-45ae-aefa-0965f38155ff', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '344539cc-94a8-443b-82c3-e6ffd6feb173', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'bb6f7b9d-abeb-49b1-bcda-ff2bc8c180bf', 'Date': 'Tue, 23 Jan 2024 22:43:20 GMT', 'Content-Length': '232'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400}

Terminal Output Screenshot(s)

Screen Shot 2024-01-23 at 17 07 23 Screen Shot 2024-01-23 at 17 07 11 Screen Shot 2024-01-23 at 17 07 00 Screen Shot 2024-01-23 at 17 06 48

Additional context

Trainium on EKS blueprint

vara-bonthu commented 5 months ago

Thanks for raising the issue. I will try this blueprint and update the same to the issue.

vara-bonthu commented 5 months ago

"Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400}

Just noticed the above error indicates the job queue is missing for Volcano. Try to run kubectl apply on the below yaml manifest that will create namespace and the Volcano queue and try to run the shell script(2-bert-pretrain-precompile.sh) again.

---
apiVersion: v1
kind: Namespace
metadata:
  name: test

# Volcano dedicated queue for ml-team-a
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test
spec:
  reclaimable: false
  weight: 1

We can update the blueprint if this works

Gall-oDrone commented 5 months ago

Hi @vara-bonthu ,

I successfully added and applied the manifest yaml file. The "Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400} error is no longer showing up, but I'm still not able to run the bert-compile pods. I'm attaching the screenshots to show my results:

Screen Shot 2024-01-25 at 12 16 35 Screen Shot 2024-01-25 at 12 16 21 Screen Shot 2024-01-25 at 12 15 53 Screen Shot 2024-01-25 at 12 15 43 Screen Shot 2024-01-25 at 12 15 31 Screen Shot 2024-01-25 at 12 15 18
vara-bonthu commented 5 months ago

It seems you've made good progress. The BERT large distributed training blueprint is utilizing Managed Node Groups, so you'll need to set the minimum and desired values to 2. These values can be updated in the variables.tf file. Here are the specific lines where you can make these changes:

Nodes Minimum Value Nodes Desired Value

After making these adjustments , please run terraform apply. This will provision two nodes of trn1.32xlarge instances. Ensure that your account has access to these nodes.

Upon completion, you should observe the pending pods transitioning to the running state.

I've noticed some gaps in the documentation that need updating. Thanks for validating and I appreciate a PR for these missing steps. Thank you!

vara-bonthu commented 4 months ago

@Gall-oDrone This blueprint has been recently updated with the fixes. Checkout the latest PR here https://github.com/awslabs/data-on-eks/pull/435

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days