Benchmark doesn't run - Githubissues

akartsky commented 5 years ago

When I submitted my toml file and check status, it says preparing for download and nothing happens after that

bai-bff/bin/anubis --submit /Users/kalamadi/Desktop/temp/try1.toml

bai-bff/bin/anubis --status 5b59f3aa-2fbb-4037-bd90-6bb47906986b

Status: [5b59f3aa-2fbb-4037-bd90-6bb47906986b] ✊ |860f9d7f|Submission has been successfully received... 🐕 |431d11b4|Initiating downloads... 🐕 |9150439b|Preparing s3://dataset-eks-imagenet/imagenet for download...

No error message was displayed. I checked the pod logs. Pods were not able to access the s3 bucket even though they had permission.

Copied the data over to my account and ran the test again.

This is probably because it's not able to access the docker image. Will check the pod logs and update the issue.

try1.toml.txt

akartsky commented 4 years ago

Copied the docker image to my account so that there are no issues related to permissions.

submitted the toml file and the status is exactly the same. (--status)

(fetcher-dispatcher logs say "Failed to get content_length for s3.Object" but a fetcher pod was created and images were being copied) I let it run for a while and checked the results.

a483e75eb9c2:bai_anubis kalamadi$ bai-bff/bin/anubis --results b7545772-897f-4f59-9837-918b8b623ed2

Brought to you by the cool peeps of the MXNet-Berlin Team .......... standard_init_linux.go:190: exec user process caused "exec format error"

standard_init_linux.go:190: exec user process caused "exec format error"

Checked the logs for pods with b--
They had nothing.

Tried changing the strategy to "single_node" in toml file.

bai-bff/bin/anubis --results 385da9e6-27f4-44c7-a3f7-88b8af7a17b7

Brought to you by the cool peeps of the MXNet-Berlin Team ..........

Let me know if you need logs from any specific pod.

(have attached the yaml file that I'm trying to convert to toml and the latest toml file) convert.yaml.txt try1.toml.txt

perdasilva commented 4 years ago

Hi,

I'm not sure I get the full picture. But, with Horovod jobs, you need to specify the benchmark_code as a script. For instance,

benchmark_code="""
#!/bin/bash
/do/some/stuff.sh
"""

So, if I understood correctly, and you want to convert that k8s yaml into a toml, your benchmark_code should looks like this

benchmark_code="""
#!/bin/bash

echo "ROLE" $ROLE

cp /etc/mpi/hostfile /root/hosts
cp /etc/mpi/hostfile /hosts
cp /etc/mpi/hostfile deep-learning-models/models/resnet/tensorflow/hosts

chmod +x deep-learning-models/models/resnet/tensorflow/train.sh
cd deep-learning-models/models/resnet/tensorflow/
./train.sh 32
"""

perdasilva commented 4 years ago

Another thing to keep in mind, is that the watcher does not yet properly support Horovod jobs. I will submit a quick PR to make this more transparent. Therefore, you should monitor your job via kubectl for now. (The initial stages of submit, i.e. fetcher, should be fine though)

perdasilva commented 4 years ago

Here it is: https://github.com/awslabs/benchmark-ai/pull/974

akartsky commented 4 years ago

Thanks. Got it. was checking the watcher to see if the data is getting copied properly. Made the changes that you suggested.

TRY 1

2bf7195e-b662-4cce-a37b-a8212bae6905 submited with "./train.sh 32"

Error :

There are not enough slots available in the system to satisfy the 32 slots that were requested by the application: python

Either request fewer slots for your application, or make more slots available for use.

TRY 2

Reduced the slots to 2 6ffe210e-d1c7-4bee-97b0-04e442d49873 submited with "./train.sh 2"

Error : Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

TRY 3

Changed the following instance_type = "p3.16xlarge" num_instances = 4

f414a7b7-63b0-43d4-bfbc-c989250c775f submited with "./train.sh 8"

Error : No such file or directory and some other errors

log.txt try1.toml.txt

perdasilva commented 4 years ago

A couple of things, I think there are a couple of issues here. Firstly, you need to tweak the processes_per_instance option in the [hardware.distributed] section. I think the default is "1", which is why it is failing. Maybe if you set it to "gpus" (or "8"), it should avoid the slots issue. This isn't your fault as it seems we have forgotten to add this setting to the documentation. I'm sorry about this.

So, the [hardware.distributed] section would look like:

[hardware.distributed]
num_instances = 4
processes_per_instance = "gpus" # or "8"

The second thing is, the error looks like /root/data/tf-imagenet; No such file or directory, but the path you are mounting your data is /data:

[[data.sources]]
# Data download URI.
src = "s3://dataset-eks-imagenet-copy2/imagenet/"
# Path where the dataset is stored in the container FS
path = "/data"

I assume this is the issue...maybe change the mount point to /root/data? or update the call to the script to some how point to /data?

gavinmbell commented 4 years ago

Updated documentation exec/README with processes_per_instance entry pull request #975

akartsky commented 4 years ago

👍🏻

added processes_per_instance = "8" changed path to "root/data". got the "No such file or directory" error again.

Changed path to "/root/data/tf-imagenet/" index out of range error

the files weren't getting copied. Anubis copies files the first time and then when the process fails it does not download the files from same bucket. I have to create a new bucket. I had already copied the bucket twice and was on "dataset-eks-imagenet-copy2". so before creating a new bucket I though of changing the "src" path and see if that works.

changed src from src = "s3://dataset-eks-imagenet-copy2/imagenet/" to src = "s3://dataset-eks-imagenet-copy2/imagenet" also path = "/root/data/tf-imagenet"

And it worked ! (I guess Anubis doesn't copy from the same path if the process fails. Attaching the fetcher logs)

a483e75eb9c2:bai_anubis kalamadi$ kubectl --kubeconfig=baictl/drivers/aws/cluster/.terraform/bai/kubeconfig logs b-14deac24-ab20-43d9-9633-7188a1f155a7-launcher-6xst2 | tail -10

4000 6.4 6028.4 2.845 3.556 0.51236 4050 6.5 6336.3 2.580 3.291 0.51875 4100 6.6 6328.3 2.920 3.632 0.52514 4150 6.6 6347.9 2.632 3.345 0.53154 4200 6.7 6340.3 2.682 3.396 0.53793 4250 6.8 6354.3 2.944 3.660 0.54432 4300 6.9 6328.5 2.572 3.289 0.55071 4350 7.0 6305.4 2.177 2.895 0.55710 4400 7.0 6331.3 2.602 3.322 0.56350 4450 7.1 6341.0 2.728 3.451 0.56989

(letting it run now)

perdasilva commented 4 years ago

Oh, that's a good catch! I guess Anubis doesn't copy from the same path if the process fails. Attaching the fetcher logs

let me see if I can root cause this...

akartsky commented 4 years ago

👍🏻 (oops forgot to attach the file ... here it is) log.txt

akartsky commented 4 years ago

Hey Those numbers are not displayed in --results

How do I see the output ?

results_log.txt

awslabs / benchmark-ai

Benchmark doesn't run #968

TRY 1

TRY 2

TRY 3