Closed chengyu-liu-cs closed 4 years ago
Here are some updates. The same codes when I ran from AML compute, I got similar but not exactly the same error message.
`
[2020-10-15T12:11:57.322103] The experiment failed. Finalizing run...
[2020-10-15T12:11:57.322179] FinalizingInRunHistory is not called
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.21553349494934082 seconds
Traceback (most recent call last):
File "train.py", line 136, in
`
Is there anyone looking into this ????
Hi, we have a solution for the timeout and are actively looking into the missing attribute.
To avoid the timeout, we suggest explicit run.upload_folder calls. This avoids our safety precautions for users with the auto outputs/ upload.
run.upload_folder("artifact_path_of_your_choosing", "folder_to_upload") # The artifact path can be prefixed with outputs/
Actually, you uncovered a bug in the code but I do not think the first issue will happen if you remove the wait_for_completion call from the remote script.
wait_for_completion is meant to be called from the script or notebook that kicks off the job, as a way of synchronizing code with the completion of that run, with some added logs to help during that process.
Thanks for uncovering this we will work on resolving that failure mode more gracefully to avoid future confusion. For now, if you can try both calling upload_folder directly and removing the wait_for_completion call you should be all set.
Thx @eedeleon for the comments. It would be nice if these information is also included in SDK document.
I just ran the experiment twice based on your suggestion, ie., removing wait_for_completion
from the kicked-off script. But only experiment failed once the error message is the same. Here are more details:
I submitted a ScriptRunConfigure and wait for completion ` src = ScriptRunConfig(source_directory=script_folder, script=script, run_config=run_config)
run = experiment.submit(src, tags=tags)
run.wait_for_completion(show_output=True) `
Inside the submitted script, I had the same codes, ie., ` run = Run.get_submitted_run()
learning_pipeline(X_train, X_test, y_train, y_test) # doing training and dumping models to the outputs folder
run.flush(timeout_seconds=600) `
` Logging experiment finalizing status in history service. Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.8896048069000244 seconds Logging warning in history service: ERROR:: RunHistoryFinalization failed. . Exception Details:Traceback (most recent call last): File "/azureml-run/azureml-setup/context_managers.py", line 294, in exit self.history_context.exit(args) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 56, in exit return self._exit_stack.exit(args) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/contextlib.py", line 380, in exit raise exc_details[1] File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/contextlib.py", line 365, in exit if cb(exc_details): File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/contextlib.py", line 284, in _exit_wrapper return cm_exit(cm, exc_details) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 150, in exit self.upload_tracked_files() File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 153, in upload_tracked_files self.py_wd.track(self.artifacts_client, self.container_id, self.trackfolders, self.deny_list) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 95, in track fs.track(artifacts_client, container_id, track_folders, blacklist) File "/azureml-envs/azureml_bbeaa79060989ad823d294ebe7257036/lib/python3.6/site-packages/azureml/_history/utils/filesystem.py", line 81, in track raise AzureMLException("\n".join(exception_messages)) azureml._common.exceptions.AzureMLException: AzureMLException: Message: Failed to flush task queue within 300 seconds InnerException None ErrorResponse { "error": { "message": "Failed to flush task queue within 300 seconds" } }
Traceback (most recent call last):
File "azureml-setup/context_manager_injector.py", line 311, in
[2020-10-22T06:55:22.919015] Finished context manager injector with Exception.
`
We will make sure to update the docs as well.
Current:
run = Run.get_submitted_run()
learning_pipeline(X_train, X_test, y_train, y_test) # doing training and dumping models to the outputs folder
run.flush(timeout_seconds=600)
Target:
run = Run.get_submitted_run()
learning_pipeline(X_train, X_test, y_train, y_test) # doing training and dumping models to the (tmp_folder) not outputs
run.upload_folder("outputs/", "tmp_folder") # This allows you to have the same artifact_paths from before the changes
# We call flush for you in the remote scenario
The above change should circumvent the auto upload logic and will allow your code to upload without running into the safety precautions we have for auto upload features
@eedeleon I've resolved this issue by running training in a remote AML compute.
I thought removing wait_for_completion(show_output=True)
from training script would resolve the issue.
I will also try run.upload_folder("outputs/", "tmp_folder")
later.
Thank you @eedeleon
Hi @chengyu-liu-cs can you let us know whether or not you received the solution to your problem and if we may close this issue?
[Enter feedback here] I am getting "Failed to flush task queue within 300 seconds" when I executed a Run. It seems the files generated inside "outputs" are very big (about 400+MB in total) and it is related to
flush(timeout_seconds=300)
function. But it is not clear when and how to use it. If there are much clearer documentations or examples, it would be nice.I have put
run.flush(timeout_seconds=600)
different lines but still getting the same error. An example is belowrun = Run.get_submitted_run() learning_pipeline(X_train, X_test, y_train, y_test) # doing training and dumping models to the outputs folder run.flush(timeout_seconds=600) run.wait_for_completion(show_output=True, wait_post_processing=True)
logs
[2020-10-15T11:12:05.280657] The experiment failed. Finalizing run... [2020-10-15T11:12:05.281220] Start FinalizingInRunHistory Logging experiment finalizing status in history service. Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 2.9097180366516113 seconds Logging warning in history service: ERROR:: RunHistoryFinalization failed. . Exception Details:Traceback (most recent call last): File "azureml-setup/context_manager_injector.py", line 166, in execute_with_context runpy.run_path(sys.argv[0], globals(), run_name="main") File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "train.py", line 136, in
run.wait_for_completion(show_output=True, wait_post_processing=True)
File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/core/run.py", line 689, in wait_for_completion
raise_on_error=raise_on_error)
File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/core/run.py", line 880, in _stream_run_output
available_logs = self._get_logs(self._current_details)
File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/core/run.py", line 816, in _get_logs
logs = [x for x in status["logFiles"] if re.match(self._output_logs_pattern, x)]
File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/core/run.py", line 816, in
logs = [x for x in status["logFiles"] if re.match(self._output_logs_pattern, x)]
AttributeError: '_SubmittedRun' object has no attribute '_output_logs_pattern'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/azureml-run/azureml-setup/context_managers.py", line 284, in exit self.history_context.exit(args) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 56, in exit return self._exit_stack.exit(args) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/contextlib.py", line 380, in exit raise exc_details[1] File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/contextlib.py", line 365, in exit if cb(exc_details): File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/contextlib.py", line 284, in _exit_wrapper return cm_exit(cm, exc_details) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 149, in exit self.py_wd.track(self.run_tracker, self.trackfolders, self.deny_list) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/_history/utils/context_managers.py", line 95, in track fs.track(run_tracker, track_folders, blacklist) File "/azureml-envs/azureml_f57f11c5666752cd12024ebdd4cacb02/lib/python3.6/site-packages/azureml/_history/utils/filesystem.py", line 81, in track raise AzureMLException("\n".join(exception_messages)) azureml._common.exceptions.AzureMLException: AzureMLException: Message: Failed to flush task queue within 300 seconds InnerException None ErrorResponse { "error": { "message": "Failed to flush task queue within 300 seconds" } }
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.