awslabs / benchmark-ai

Anubis (formerly known as Benchmark AI), measures the goodness of machine learning workloads
Apache License 2.0
16 stars 6 forks source link

[Executor] script mode not working #1016

Open surajkota opened 4 years ago

surajkota commented 4 years ago

TOML file used: ec2_tf_cpu_single_node_synthetic.toml.txt

client keeps polling for status of the benchmark and does not recieve the error message hence marking it as related to #1001 #996

Error in executor:

Traceback (most recent call last):
  File "/opt/env/lib/python3.7/site-packages/bai_kafka_utils/kafka_service.py", line 157, in safe_handle_msg
    self.handle_event(msg.value, callback)
  File "/opt/env/lib/python3.7/site-packages/bai_kafka_utils/kafka_service.py", line 175, in handle_event
    callback.handle_event(event, self)
  File "/opt/env/lib/python3.7/site-packages/bai_kafka_utils/executors/execution_callback.py", line 86, in handle_event
    job = engine.run(event)
  File "/opt/env/lib/python3.7/site-packages/executor-0.0.0-py3.7.egg/executor/k8s_execution_engine.py", line 51, in run
    descriptor_contents, self.config, fetched_data_sources, fetched_models, scripts, job_id, event=event
  File "/opt/env/lib/python3.7/site-packages/executor-0.0.0-py3.7.egg/transpiler/bai_knowledge.py", line 658, in create_job_yaml_spec
    descriptor = BenchmarkDescriptor.from_dict(descriptor_contents, executor_config.descriptor_config)
  File "/opt/env/lib/python3.7/site-packages/bai_kafka_utils/executors/descriptor.py", line 243, in from_dict
    strict=True,
  File "/opt/env/lib/python3.7/site-packages/dacite/core.py", line 59, in from_dict
    value = _build_value(type_=field.type, data=transformed_value, config=config)
  File "/opt/env/lib/python3.7/site-packages/dacite/core.py", line 82, in _build_value
    return _build_value_for_union(union=type_, data=data, config=config)
  File "/opt/env/lib/python3.7/site-packages/dacite/core.py", line 93, in _build_value_for_union
    return _build_value(type_=types[0], data=data, config=config)
  File "/opt/env/lib/python3.7/site-packages/dacite/core.py", line 86, in _build_value
    return from_dict(data_class=type_, data=data, config=config)
  File "/opt/env/lib/python3.7/site-packages/dacite/core.py", line 49, in from_dict
    raise UnexpectedDataError(keys=extra_fields)
dacite.exceptions.UnexpectedDataError: can not match "script" to any data class field

recording some more part of log incase needed

2020-01-28 01:51:18,126 INFO: Got event FetcherBenchmarkEvent(action_id='889808e0-60f6-4f50-8fa4-f07aa903475b', parent_action_id=None, message_id='51e66eca-2e52-4f85-9733-517fd6c2e90f', client_id='10ee3cf3d2e7a7944b385579fe8addeec35f90f1', client_version='0.6.2', client_username='surakota', authenticated=False, tstamp=1580176278091, visited=[VisitedService(svc='anubis-client', tstamp=1580176275000, version='0.6.2', node=None), VisitedService(svc='bai-bff', tstamp=1580176278046, version='0.6.2', node='bai-bff-5679dc9c77-kpjmp'), VisitedService(svc='fetcher-dispatcher', tstamp=1580176278091, version='1.0', node='fetcher-dispatcher-67c6bfbb8c-f94tt')], type='BAI_APP_FETCHER', payload=FetcherPayload(toml=BenchmarkDoc(contents={'output': {'metrics': [{'name': 'throughput', 'pattern': 'images/sec: (\\d*.\\d+|\\d+)', 'units': 'img/sec'}]}, 'spec_version': '0.1.0', 'env': {'docker_image': '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.15.0-cpu-py36-ubuntu18.04'}, 'info': {'task_name': 'tensorflow_1.15_job', 'description': ' Testing TensorFlow 1.15 job ', 'labels': {'task-name': 'tf_cpu_c5_18xlarge_single_node'}}, 'hardware': {'aws_zone_id': 'use1-az2', 'strategy': 'single_node', 'instance_type': 'c5.18xlarge'}, 'ml': {'args': '--batch_size=256 --model=resnet50_v1.5 --train_dir=$HOME/test00 --device=cpu --data_format=NHWC --num_inter_threads=0 --num_intra_threads=36 --mkl=True --kmp_blocktime=0', 'script': {'script': '623453a8c317cbd92b9cb1d2d2d6dbe4f0c9298c.tar'}, 'benchmark_code': 'python $(BAI_SCRIPTS_PATH)/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py'}}, doc='IyBCZW5jaG1hcmtBSSBtZXRhCnNwZWNfdmVyc2lvbiA9ICIwLjEuMCIKCiMgVGhlc2UgZmllbGRzIGRvbid0IGhhdmUgYW55IGltcGFjdCBvbiB0aGUgam9iIHRvIHJ1biwgdGhleSBjb250YWluCiMgbWVyZWx5IGluZm9ybWF0aXZlIGRhdGEgc28gdGhlIGJlbmNobWFyayBjYW4gYmUgY2F0ZWdvcml6ZWQgd2hlbiBkaXNwbGF5ZWQKIyBpbiB0aGUgZGFzaGJvYXJkLgpbaW5mb10KdGFza19uYW1lID0gInRlbnNvcmZsb3dfMS4xNV9qb2IiCmRlc2NyaXB0aW9uID0gIiIiIFwKICAgIFRlc3RpbmcgVGVuc29yRmxvdyAxLjE1IGpvYiBcCiAgICAiIiIKW2luZm8ubGFiZWxzXQojIExhYmVscyBhbmQgdmFsdWVzIG11c3QgYmUgNjMgY2hhcmFjdGVycyBvciBsZXNzLCBiZWdpbm5pbmcgYW5kIGVuZGluZyB3aXRoIGFuIGFscGhhbnVtZXJpYyBjaGFyYWN0ZXIKIyAoW2EtejAtOUEtWl0pIHdpdGggZGFzaGVzICgtKSwgdW5kZXJzY29yZXMgKF8pLCBkb3RzICguKSwgYW5kIGFscGhhbnVtZXJpY3MgYmV0d2Vlbgp0YXNrLW5hbWUgPSAidGZfY3B1X2M1XzE4eGxhcmdlX3NpbmdsZV9ub2RlIgoKIyAxLiBIYXJkd2FyZQpbaGFyZHdhcmVdCmluc3RhbmNlX3R5cGUgPSAiYzUuMTh4bGFyZ2UiCnN0cmF0ZWd5ID0gInNpbmdsZV9ub2RlIgphd3Nfem9uZV9pZD0idXNlMS1hejIiCgoKIyAyLiBFbnZpcm9ubWVudApbZW52XQpkb2NrZXJfaW1hZ2UgPSAiNzYzMTA0MzUxODg0LmRrci5lY3IudXMtZWFzdC0xLmFtYXpvbmF3cy5jb20vdGVuc29yZmxvdy10cmFpbmluZzoxLjE1LjAtY3B1LXB5MzYtdWJ1bnR1MTguMDQiCgojIDMuIE1hY2hpbmUgbGVhcm5pbmcgcmVsYXRlZCBzZXR0aW5nczogCiMgZGF0YXNldCwgYmVuY2htYXJrIGNvZGUgYW5kIHBhcmFtZXRlcnMgaXQgdGFrZXMKW21sXQoKYmVuY2htYXJrX2NvZGUgPSAicHl0aG9uICQoQkFJX1NDUklQVFNfUEFUSCkvYmVuY2htYXJrcy9zY3JpcHRzL3RmX2Nubl9iZW5jaG1hcmtzL3RmX2Nubl9iZW5jaG1hcmtzLnB5IgoKYXJncyA9ICItLWJhdGNoX3NpemU9MjU2IC0tbW9kZWw9cmVzbmV0NTBfdjEuNSAtLXRyYWluX2Rpcj0kSE9NRS90ZXN0MDAgLS1kZXZpY2U9Y3B1IC0tZGF0YV9mb3JtYXQ9TkhXQyAtLW51bV9pbnRlcl90aHJlYWRzPTAgLS1udW1faW50cmFfdGhyZWFkcz0zNiAtLW1rbD1UcnVlIC0ta21wX2Jsb2NrdGltZT0wIgoKIyA1LiBPdXRwdXQKW291dHB1dF0KIyBbT3B0XSBDdXN0b20gbWV0cmljcyBkZXNjcmlwdGlvbnMKIyBMaXN0IGFsbCByZXF1aXJlZCBtZXRyaWNzIGRlc2NyaXB0aW9ucyBiZWxvdy4KIyBNYWtlIGFuIGVudHJ5IGluIHNhbWUgZm9ybWF0IGFzIHRoZSBvbmUgYmVsb3cuCltbb3V0cHV0Lm1ldHJpY3NdXQojIE5hbWUgb2YgdGhlIG1ldHJpYyB0aGF0IHdpbGwgYXBwZWFyIGluIHRoZSBkYXNoYm9hcmRzLgpuYW1lID0gInRocm91Z2hwdXQiCiMgTWV0cmljIHVuaXQgKHJlcXVpcmVkKQp1bml0cyA9ICJpbWcvc2VjIgojIFBhdHRlcm4gZm9yIGxvZyBwYXJzaW5nIGZvciB0aGlzIG1ldHJpYy4KcGF0dGVybiA9ICJpbWFnZXNcXC9zZWM6IChcXGQqLlxcZCt8XFxkKykiCgojIyMgLS0tIGJlZ2lubmluZyBvZiBhbnViaXMgZ2VuZXJhdGVkIGVudHJ5IC0tLSAjIyMKW21sLnNjcmlwdF0Kc2NyaXB0ID0gIjYyMzQ1M2E4YzMxN2NiZDkyYjljYjFkMmQyZDZkYmU0ZjBjOTI5OGMudGFyIgojIyMgLS0tIGVuZGluZyBvZiBhbnViaXMgZ2VuZXJhdGVkIGVudHJ5IC0tLSAjIyMKCg==', sha1='036a07d89c221aeac95e7a796348618489430807', descriptor_filename='ec2_tf_cpu_single_node_synthetic.toml', verified=True), datasets=[], models=[], scripts=[FileSystemObject(dst='s3://scripts-exchange-f629ee3b648e36be/anubis0/623453a8c317cbd92b9cb1d2d2d6dbe4f0c9298c.tar')]))

client behavior:

(base) 8c8590431d24:benchmark-ai surakota$ bai-bff/bin/anubis --submit ~/Downloads/ec2_tf_cpu_single_node_synthetic.toml --watch --status --script ~/repo/benchmarks

                       _      _
                      | |    (_)
   __ _  _ __   _   _ | |__   _  ___
  / _  ||  _ \ | | | ||  _ \ | |/ __|
 | (_| || | | || |_| || |_) || |\__ \
  \__,_||_| |_| \__,_||_.__/ |_||___/ ♎

(v0.6.2)
-------------------------
AWS: Benchmark AI Client
-------------------------

Brought to you by the cool peeps of the  MXNet-Berlin Team
..........
 👀 
Using Script: 623453a8c317cbd92b9cb1d2d2d6dbe4f0c9298c.tar
♎ |00000000|sending /Users/surakota/Downloads/ec2_tf_cpu_single_node_synthetic.toml and 623453a8c317cbd92b9cb1d2d2d6dbe4f0c9298c.tar to anubis service @ addc3da76c48b11e992120e4a79f97f9-775518514.us-east-1.elb.amazonaws.com
Status: [889808e0-60f6-4f50-8fa4-f07aa903475b]
.✊ |33cef571|Submission has been successfully received...
🐕 |d846d78d|Nothing to fetch
⚡ |13979835|Processing benchmark submission request...
..............................
perdasilva commented 4 years ago

Right - the dataclasses in the descriptor.py referenced in the stack trace need to be updated to include the [ml.script] section. This should also be followed up with possible changes in the transpiling done in the executor to ensure that the script is being set appropriately in the underlying k8s resource it is producing.

We should also add test coverage for this as it is a regression. It got introduced when we refactored the descriptor into dataclasses...