allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
380 stars 131 forks source link

local docker trains server don't work and prompts - allow delete (api) error #58

Open uriariel opened 4 years ago

uriariel commented 4 years ago

Trains server running on docker compose isn't able to perform experiments and prompts the next error:

this is the errors from trains integration in my code:

Starting epoch 1, train length: 23, validation length: 3
2020-08-17 03:54:37,397 - trains.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '90b57eea628b4a41b6487872aad826c2', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-log-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625675373, 'type': 'log', 'task': '7a140322ed874f3fa070edc443de0277', 'level': 'info', 'worker': 'uriariel-UX430UNR', 'msg': 'TRAINS Task: created new task id=7a140322ed874f3fa070edc443de0277\nTRAINS results page: http://localhost:8080/projects/beaae6c1c5004c778cbd73da3a3bfcd0/experiments/7a140322ed874f3fa070edc443de0277/output/log\n======> WARNING! UNCOMMITTED CHANGES IN REPOSITORY git@github.com:uriariel/deep-learning-final-project.git <======\nStarting epoch 1, train length: 23, validation length: 3', '@timestamp': '2020-08-17T00:54:37.390Z', 'metric': '', 'variant': ''}}}]), extra_info=index [events-log-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>)
batch time: 3.1553523540496826, batch max: 111, batch min: 104
2020-08-17 03:54:39,396 - trains.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': 'ec7ae2b8309d4c42ae71f581d3a655b8', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-log-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625678761, 'type': 'log', 'task': '7a140322ed874f3fa070edc443de0277', 'level': 'info', 'worker': 'uriariel-UX430UNR', 'msg': 'batch time: 3.1553523540496826, batch max: 111, batch min: 104', '@timestamp': '2020-08-17T00:54:39.389Z', 'metric': '', 'variant': ''}}}]), extra_info=index [events-log-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>)
2020-08-17 03:54:39,414 - trains.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '172fc5eeb4795e9ead44fb62af3f6a68', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625678754, 'type': 'training_stats_scalar', 'task': '7a140322ed874f3fa070edc443de0277', 'iter': 0, 'metric': 'Loss', 'variant': 'train_loss', 'value': 4.6759076, '@timestamp': '2020-08-17T00:54:39.409Z', 'worker': 'uriariel-UX430UNR'}}}]), extra_info=index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>
2020-08-17 03:54:39,415 - trains.Metrics - ERROR - Failed reporting metrics: <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '172fc5eeb4795e9ead44fb62af3f6a68', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625678754, 'type': 'training_stats_scalar', 'task': '7a140322ed874f3fa070edc443de0277', 'iter': 0, 'metric': 'Loss', 'variant': 'train_loss', 'value': 4.6759076, '@timestamp': '2020-08-17T00:54:39.409Z', 'worker': 'uriariel-UX430UNR'}}}]), extra_info=index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>
2020-08-17 03:54:41,413 - trains.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '3b2102851b3d49a8b8e38aa0b4498007', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-log-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625679415, 'type': 'log', 'task': '7a140322ed874f3fa070edc443de0277', 'level': 'info', 'worker': 'uriariel-UX430UNR', 'msg': "2020-08-17 03:54:39,414 - trains.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '172fc5eeb4795e9ead44fb62af3f6a68', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625678754, 'type': 'training_stats_scalar', 'task': '7a140322ed874f3fa070edc443de0277', 'iter': 0, 'metric': 'Loss', 'variant': 'train_loss', 'value': 4.6759076, '@timestamp': '2020-08-17T00:54:39.409Z', 'worker': 'uriariel-UX430UNR'}}}]), extra_info=index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>\n2020-08-17 03:54:39,415 - trains.Metrics - ERROR - Failed reporting metrics: <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '172fc5eeb4795e9ead44fb62af3f6a68', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625678754, 'type': 'training_stats_scalar', 'task': '7a140322ed874f3fa070edc443de0277', 'iter': 0, 'metric': 'Loss', 'variant': 'train_loss', 'value': 4.6759076, '@timestamp': '2020-08-17T00:54:39.409Z', 'worker': 'uriariel-UX430UNR'}}}]), extra_info=index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>", '@timestamp': '2020-08-17T00:54:41.398Z', 'metric': '', 'variant': ''}}}]), extra_info=index [events-log-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>)
batch time: 3.2711617946624756, batch max: 111, batch min: 111
2020-08-17 03:54:43,409 - trains.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '5206efc5a50efdaed320ed5f7e7983f1', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625682066, 'type': 'training_stats_scalar', 'task': '7a140322ed874f3fa070edc443de0277', 'iter': 1, 'metric': 'Loss', 'variant': 'train_loss', 'value': 4.67215, '@timestamp': '2020-08-17T00:54:43.396Z', 'worker': 'uriariel-UX430UNR'}}}]), extra_info=index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>
2020-08-17 03:54:43,410 - trains.Metrics - ERROR - Failed reporting metrics: <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '5206efc5a50efdaed320ed5f7e7983f1', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625682066, 'type': 'training_stats_scalar', 'task': '7a140322ed874f3fa070edc443de0277', 'iter': 1, 'metric': 'Loss', 'variant': 'train_loss', 'value': 4.67215, '@timestamp': '2020-08-17T00:54:43.396Z', 'worker': 'uriariel-UX430UNR'}}}]), extra_info=index [events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>
2020-08-17 03:54:43,428 - trains.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '7ee1da6bd2f84a989e166334b7840286', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-log-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625682072, 'type': 'log', 'task': '7a140322ed874f3fa070edc443de0277', 'level': 'info', 'worker': 'uriariel-UX430UNR', 'msg': 'batch time: 3.2711617946624756, batch max: 111, batch min: 111', '@timestamp': '2020-08-17T00:54:43.404Z', 'metric': '', 'variant': ''}}}]), extra_info=index [events-log-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>)

And this is the errors I get from docker-compose logs:

trains-apiserver  | [2020-08-17 00:58:32,149] [9] [ERROR] [trains.service_repo] Returned 500 for workers.status_report in 23ms, msg=General data error (Failed processing worker status report): err=7 document(s) failed to index.
trains-apiserver  | [2020-08-17 00:58:32,524] [9] [ERROR] [trains.service_repo] Returned 500 for queues.get_next_task in 7ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08', '_type': '_doc', '_id': 'Ufvs-XMBC4p8rKHYzB_K', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625912521, 'queue': 'd89397bd112149838d9fd0c346b93e34', 'average_waiting_time': 0, 'queue_length': 0}}}]), extra_info=index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
trains-apiserver  | [2020-08-17 00:58:37,555] [9] [ERROR] [trains.service_repo] Returned 500 for queues.get_next_task in 14ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08', '_type': '_doc', '_id': 'Uvvs-XMBC4p8rKHY4B9v', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625917547, 'queue': 'd89397bd112149838d9fd0c346b93e34', 'average_waiting_time': 0, 'queue_length': 0}}}]), extra_info=index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
trains-apiserver  | [2020-08-17 00:58:42,582] [9] [ERROR] [trains.service_repo] Returned 500 for queues.get_next_task in 6ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08', '_type': '_doc', '_id': 'U_vs-XMBC4p8rKHY9B8U', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625922579, 'queue': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
trains-apiserver  | [2020-08-17 00:58:47,605] [9] [ERROR] [trains.service_repo] Returned 500 for queues.get_next_task in 7ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08', '_type': '_doc', '_id': 'VPvt-XMBC4p8rKHYBx-y', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625927600, 'queue': 'd89397bd112149838d9fd0c346b93e34', 'average_waiting_time': 0, 'queue_length': 0}}}]), extra_info=index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
trains-apiserver  | [2020-08-17 00:58:52,638] [9] [ERROR] [trains.service_repo] Returned 500 for queues.get_next_task in 13ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08', '_type': '_doc', '_id': 'Vfvt-XMBC4p8rKHYGx9Z', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625932630, 'queue': 'd89397bd112149838d9fd0c346b93e34', 'average_waiting_time': 0, 'queue_length': 0}}}]), extra_info=index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
d89397bd112149838d9fd0c346b93e34', 'average_waiting_time': 0, 'queue_length': 0}}}]), extra_info=index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
trains-apiserver  | [2020-08-17 00:58:47,605] [9] [ERROR] [trains.service_repo] Returned 500 for queues.get_next_task in 7ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08', '_type': '_doc', '_id': 'VPvt-XMBC4p8rKHYBx-y', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625927600, 'queue': 'd89397bd112149838d9fd0c346b93e34', 'average_waiting_time': 0, 'queue_length': 0}}}]), extra_info=index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
trains-apiserver  | [2020-08-17 00:58:52,638] [9] [ERROR] [trains.service_repo] Returned 500 for queues.get_next_task in 13ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08', '_type': '_doc', '_id': 'Vfvt-XMBC4p8rKHYGx9Z', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1597625932630, 'queue': 'd89397bd112149838d9fd0c346b93e34', 'average_waiting_time': 0, 'queue_length': 0}}}]), extra_info=index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-08] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
jkhenning commented 4 years ago

Hi @uriariel ,

The response seems to indicate the ElasticSearch component is working in read-only mode - this might be caused by several reasons, among them are resource issues (i.e. insufficient storage on disk) or a migration issue.

Two questions:

uriariel commented 4 years ago

I installed the server and used it with no issues until shutting it down after a few hours. (with docker-compoe down) When I tried to re run it I ranned into this issue.

I'm sorry, I solved the problem by removing all of the data directories and recreating them, so I don't think that the current logs will be of any help.

jkhenning commented 4 years ago

Just a question - is it possible the previous server you installed (and than took down) was v0.15.1 (or earlier) and the newly installed server is v0.16.0?

uriariel commented 4 years ago

I don't think so, I didn't touch the docker-compoe.yml file.

jkhenning commented 4 years ago

OK, thanks for your time 👍 You're welcome to close this issue if the problem is currently solved (you can always reopen if it happens again...)

uriariel commented 4 years ago

After another tackle with this issue, I understood that the problem occurs when the host running the docker is dealing with almost full storage. Elasticsearch has a feature that switches it into ease only mode if the server has storage issues.

I just cleaned up a bit and it got back to normal.

jkhenning commented 4 years ago

I see. See my last comment here for more info on how to configure ES (via our docker-compose) for different watermarks. However, the best option is more disk space 😄

uriariel commented 4 years ago

Thanks