anyscale / academy

Ray tutorials from Anyscale
https://anyscale.com
Apache License 2.0
580 stars 195 forks source link

TuneError: ('Trials did not complete', [contrib_LinUCB_SimpleContextualBandit_65d05_00000]) #14

Closed mathematicsofpaul closed 4 years ago

mathematicsofpaul commented 4 years ago

Hey there,

Just having issues with Cell 9 of the 03-Simple-Multi-Armed-Bandit notebook. Everything else runs fine except until that point. In particular, I have ray intialized correctly and the following

ray.init(address='auto', ignore_reinit_error=True)

generates:

{'node_ip_address': '192.168.1.105', 'raylet_ip_address': '192.168.1.105', 'redis_address': '192.168.1.105:6379', 'object_store_address': '/tmp/ray/session_2020-07-18_20-05-29_369775_4270/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2020-07-18_20-05-29_369775_4270/sockets/raylet', 'webui_url': 'localhost:8265', 'session_dir': '/tmp/ray/session_2020-07-18_20-05-29_369775_4270'}

The error occurs as follows:

start_time = time.time() analysis = ray.tune.run("contrib/LinUCB", config=config, stop=stop, progress_reporter=JupyterNotebookReporter(overwrite=False), # This is the default, actually. verbose=2, # Change to 0 or 1 to reduce the output. ray_auto_init=False, # Don't allow Tune to initialize Ray. )

Generates:

`== Status == Memory usage on this node: 7.5/15.6 GiB Using FIFO scheduling algorithm. Resources requested: 0/4 CPUs, 0/2 GPUs, 0.0/8.06 GiB heap, 0.0/2.78 GiB objects Result logdir: /home/paul/ray_results/contrib/LinUCB Number of trials: 1 (1 ERROR) Trial name status loc contrib_LinUCB_SimpleContextualBandit_65d05_00000 ERROR

Number of errored trials: 1 Trial name # failures error file contrib_LinUCB_SimpleContextualBandit_65d05_00000 1 /home/paul/ray_results/contrib/LinUCB/contrib_LinUCB_SimpleContextualBandit_0_2020-07-19_00-46-00_376v7hp/error.txt


TuneError Traceback (most recent call last)

in 4 progress_reporter=JupyterNotebookReporter(overwrite=False), # This is the default, actually. 5 verbose=2, # Change to 0 or 1 to reduce the output. ----> 6 ray_auto_init=False, # Don't allow Tune to initialize Ray. 7 ) ~/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/tune/tune.py in run(run_or_experiment, name, stop, config, resources_per_trial, num_samples, local_dir, upload_dir, trial_name_creator, loggers, sync_to_cloud, sync_to_driver, checkpoint_freq, checkpoint_at_end, sync_on_checkpoint, keep_checkpoints_num, checkpoint_score_attr, global_checkpoint_period, export_formats, max_failures, fail_fast, restore, search_alg, scheduler, with_server, server_port, verbose, progress_reporter, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, return_trials, ray_auto_init) 347 if incomplete_trials: 348 if raise_on_failed_trial: --> 349 raise TuneError("Trials did not complete", incomplete_trials) 350 else: 351 logger.error("Trials did not complete: %s", incomplete_trials) TuneError: ('Trials did not complete', [contrib_LinUCB_SimpleContextualBandit_65d05_00000]) `
deanwampler commented 4 years ago

Thanks for posting this. I'll investigate.

deanwampler commented 4 years ago

I haven't been able to reproduce this. I've made some refinements lately to these notebook, as well as created new material for Tune and Serve. It's possible I fixed the issue by "accident". Could you try the latest on master or this release, which I just produced: https://github.com/anyscale/academy/releases/tag/v150 ? If it still causes problems, send me as much of the output as you can, especially anything that looks like a warning or error.

Thanks.

mathematicsofpaul commented 4 years ago

Thank you for getting back to me Dean! Here is a link to all the outputs that i could think of:

https://drive.google.com/drive/folders/1nxVtQrhI0aYv6XoU-FzxAzBJ7VEJ0LDS?usp=sharing

It consist of screenshots, the notebook with the outputs i am seeing on my end, the markdown version and many more. The notebook that has the most information is notebook 3 and the issue arises when i execute Cell 9. Other outputs of Notebook 6 are also included however not as much detail is included as the errors are quite similar!

mathematicsofpaul commented 4 years ago

@deanwampler i have created a new conda env with the new .yml files you uploaded and the code works fine now! Strange, i still suspect this bug will come to haunt me later in the future, so still very curious as to what the issue was!

deanwampler commented 4 years ago

I looked at the error.txt file under the "Notebook 3" folder, where a stack trace is seen that indicates a library mismatch to me. I'm glad you got it working. I'm going to close this issue now, but feel free to reopen if you see it again. Thanks!