Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.1k stars 2.52k forks source link

"Failed to flush task queue within 300 seconds" #1189

Closed chengyu-liu-cs closed 4 years ago

chengyu-liu-cs commented 4 years ago

[Enter feedback here] I am getting "Failed to flush task queue within 300 seconds" when I executed a Run. It seems the files generated inside "outputs" are very big (about 400+MB in total) and it is related to flush(timeout_seconds=300) function. But it is not clear when and how to use it. If there are much clearer documentations or examples, it would be nice.

I have put run.flush(timeout_seconds=600) different lines but still getting the same error. An example is below

run = Run.get_submitted_run() learning_pipeline(X_train, X_test, y_train, y_test) # doing training and dumping models to the outputs folder run.flush(timeout_seconds=600) run.wait_for_completion(show_output=True, wait_post_processing=True)

logs

[2020-10-15T11:12:05.280657] The experiment failed. Finalizing run... [2020-10-15T11:12:05.281220] Start FinalizingInRunHistory Logging experiment finalizing status in history service. Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 2.9097180366516113 seconds Logging warning in history service: ERROR:: RunHistoryFinalization failed. . Exception Details:Traceback (most recent call last): File "azureml-setup/context_manager_injector.py", line 166, in execute_with_context runpy.run_path(sys.argv[0], globals(), run_name="main") File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "train.py", line 136, in run.wait_for_completion(show_output=True, wait_post_processing=True) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/core/run.py", line 689, in wait_for_completion raise_on_error=raise_on_error) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/core/run.py", line 880, in _stream_run_output available_logs = self._get_logs(self._current_details) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/core/run.py", line 816, in _get_logs logs = [x for x in status["logFiles"] if re.match(self._output_logs_pattern, x)] File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/core/run.py", line 816, in logs = [x for x in status["logFiles"] if re.match(self._output_logs_pattern, x)] AttributeError: '_SubmittedRun' object has no attribute '_output_logs_pattern'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/azureml-run/azureml-setup/context_managers.py", line 284, in exit self.history_context.exit(args) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 56, in exit return self._exit_stack.exit(args) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/contextlib.py", line 380, in exit raise exc_details[1] File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/contextlib.py", line 365, in exit if cb(exc_details): File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/contextlib.py", line 284, in _exit_wrapper return cm_exit(cm, exc_details) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 149, in exit self.py_wd.track(self.run_tracker, self.trackfolders, self.deny_list) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 95, in track fs.track(run_tracker, track_folders, blacklist) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/_history/utils/filesystem.py", line 81, in track raise AzureMLException("\n".join(exception_messages)) azureml._common.exceptions.AzureMLException: AzureMLException: Message: Failed to flush task queue within 300 seconds InnerException None ErrorResponse { "error": { "message": "Failed to flush task queue within 300 seconds" } }


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

chengyu-liu-cs commented 4 years ago

Here are some updates. The same codes when I ran from AML compute, I got similar but not exactly the same error message.

`

[2020-10-15T12:11:57.322103] The experiment failed. Finalizing run... [2020-10-15T12:11:57.322179] FinalizingInRunHistory is not called Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.21553349494934082 seconds Traceback (most recent call last): File "train.py", line 136, in run.wait_for_completion(show_output=True, wait_post_processing=True) File "/azureml-envs/azureml_0fde3903c1ef012624b70630cfc21a12/lib/python3.6/site-packages/azureml/core/run.py", line 688, in wait_for_completion raise_on_error=raise_on_error) File "/azureml-envs/azureml_0fde3903c1ef012624b70630cfc21a12/lib/python3.6/site-packages/azureml/core/run.py", line 879, in _stream_run_output available_logs = self._get_logs(self._current_details) File "/azureml-envs/azureml_0fde3903c1ef012624b70630cfc21a12/lib/python3.6/site-packages/azureml/core/run.py", line 815, in _get_logs logs = [x for x in status["logFiles"] if re.match(self._output_logs_pattern, x)] File "/azureml-envs/azureml_0fde3903c1ef012624b70630cfc21a12/lib/python3.6/site-packages/azureml/core/run.py", line 815, in logs = [x for x in status["logFiles"] if re.match(self._output_logs_pattern, x)] AttributeError: '_SubmittedRun' object has no attribute '_output_logs_pattern'

`

chengyu-liu-cs commented 4 years ago

Is there anyone looking into this ????

eedeleon commented 4 years ago

Hi, we have a solution for the timeout and are actively looking into the missing attribute.

To avoid the timeout, we suggest explicit run.upload_folder calls. This avoids our safety precautions for users with the auto outputs/ upload.

run.upload_folder("artifact_path_of_your_choosing", "folder_to_upload") # The artifact path can be prefixed with outputs/

eedeleon commented 4 years ago

Actually, you uncovered a bug in the code but I do not think the first issue will happen if you remove the wait_for_completion call from the remote script.

wait_for_completion is meant to be called from the script or notebook that kicks off the job, as a way of synchronizing code with the completion of that run, with some added logs to help during that process.

Thanks for uncovering this we will work on resolving that failure mode more gracefully to avoid future confusion. For now, if you can try both calling upload_folder directly and removing the wait_for_completion call you should be all set.

chengyu-liu-cs commented 4 years ago

Thx @eedeleon for the comments. It would be nice if these information is also included in SDK document.

I just ran the experiment twice based on your suggestion, ie., removing wait_for_completion from the kicked-off script. But only experiment failed once the error message is the same. Here are more details:

I submitted a ScriptRunConfigure and wait for completion ` src = ScriptRunConfig(source_directory=script_folder, script=script, run_config=run_config)

run = experiment.submit(src, tags=tags)

run.wait_for_completion(show_output=True) `

Inside the submitted script, I had the same codes, ie., ` run = Run.get_submitted_run()

learning_pipeline(X_train, X_test, y_train, y_test) # doing training and dumping models to the outputs folder

run.flush(timeout_seconds=600) `

` Logging experiment finalizing status in history service. Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.8896048069000244 seconds Logging warning in history service: ERROR:: RunHistoryFinalization failed. . Exception Details:Traceback (most recent call last): File "/azureml-run/azureml-setup/context_managers.py", line 294, in exit self.history_context.exit(args) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 56, in exit return self._exit_stack.exit(args) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/contextlib.py", line 380, in exit raise exc_details[1] File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/contextlib.py", line 365, in exit if cb(exc_details): File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/contextlib.py", line 284, in _exit_wrapper return cm_exit(cm, exc_details) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 150, in exit self.upload_tracked_files() File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 153, in upload_tracked_files self.py_wd.track(self.artifacts_client, self.container_id, self.trackfolders, self.deny_list) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 95, in track fs.track(artifacts_client, container_id, track_folders, blacklist) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/filesystem.py", line 81, in track raise AzureMLException("\n".join(exception_messages)) azureml._common.exceptions.AzureMLException: AzureMLException: Message: Failed to flush task queue within 300 seconds InnerException None ErrorResponse { "error": { "message": "Failed to flush task queue within 300 seconds" } }

Traceback (most recent call last): File "azureml-setup/context_manager_injector.py", line 311, in execute_with_context(cm_objects, options.invocation) File "azureml-setup/context_manager_injector.py", line 216, in execute_with_context print("[{}] FinalizingInRunHistory is not called".format(datetime.datetime.utcnow().isoformat())) File "/azureml-run/azureml-setup/_vendor_contextlib2.py", line 393, in exit _reraise_with_existing_context(exc_details) File "/azureml-run/azureml-setup/_vendor_contextlib2.py", line 261, in _reraise_with_existing_context raise exc_details[1] File "/azureml-run/azureml-setup/_vendor_contextlib2.py", line 382, in exit if cb(exc_details): File "/azureml-run/azureml-setup/_vendor_contextlib2.py", line 314, in _exit_wrapper return cm_exit(cm, exc_details) File "azureml-setup/context_manager_injector.py", line 67, in exit self.context_manager.exit(exc_details) File "/azureml-run/azureml-setup/context_managers.py", line 294, in exit self.history_context.exit(args) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 56, in exit return self._exit_stack.exit(args) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/contextlib.py", line 380, in exit raise exc_details[1] File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/contextlib.py", line 365, in exit if cb(exc_details): File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/contextlib.py", line 284, in _exit_wrapper return cm_exit(cm, *exc_details) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 150, in exit self.upload_tracked_files() File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 153, in upload_tracked_files self.py_wd.track(self.artifacts_client, self.container_id, self.trackfolders, self.deny_list) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 95, in track fs.track(artifacts_client, container_id, track_folders, blacklist) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/filesystem.py", line 81, in track raise AzureMLException("\n".join(exception_messages)) azureml._common.exceptions.AzureMLException: AzureMLException: Message: Failed to flush task queue within 300 seconds InnerException None ErrorResponse { "error": { "message": "Failed to flush task queue within 300 seconds" } }

[2020-10-22T06:55:22.919015] Finished context manager injector with Exception.

`

eedeleon commented 4 years ago

We will make sure to update the docs as well.

Current:

run = Run.get_submitted_run()

learning_pipeline(X_train, X_test, y_train, y_test) # doing training and dumping models to the outputs folder

run.flush(timeout_seconds=600)

Target:

run = Run.get_submitted_run()

learning_pipeline(X_train, X_test, y_train, y_test) # doing training and dumping models to the (tmp_folder) not outputs
run.upload_folder("outputs/", "tmp_folder")  # This allows you to have the same artifact_paths from before the changes

# We call flush for you in the remote scenario

The above change should circumvent the auto upload logic and will allow your code to upload without running into the safety precautions we have for auto upload features

chengyu-liu-cs commented 4 years ago

@eedeleon I've resolved this issue by running training in a remote AML compute.

I thought removing wait_for_completion(show_output=True) from training script would resolve the issue.

I will also try run.upload_folder("outputs/", "tmp_folder") later. Thank you @eedeleon

cartacioS commented 4 years ago

Hi @chengyu-liu-cs can you let us know whether or not you received the solution to your problem and if we may close this issue?