Open BrianG13 opened 3 years ago
Hi @BrianG13,
This issue (as well as https://github.com/allegroai/clearml/issues/315) seem to be related to insufficient server disk space causing ES to go into read-only mode or turn active shards into inactive or unassigned.
The disk watermarks controlling the ES free-disk constraints are defined by default as % of the disk space (so it might look to you like you still have plenty of space, but ES thinks otherwise).
If you don't have enough free disk space, clean up files to create more, or resize your partition.
If you do have enough space, you can configure different ES settings in the docker-compose.yml
file (see here - there are 3 settings, all can be identical)
Hi @jkhenning , thanks for you reply!
I free a lot of space in my disk as recommended, but I still can't access any log/plot of any experiment, even old experiments that worked.
There is anything I can do to fix this?
How much disk space did you free up? What is the total size of your disk?
I free something like 3T. My disk total size is of 7.3T.
I have another question, I want to try and restart my docker, but I am wondering if all the experiments history & logs will be disappear as result of the restart.
Do you know what is the worst-case scenario for following the above orders?
I free something like 3T. My disk total size is of 7.3T.
Can you share the docker-compose.yml
you're using?
I have another question, I want to try and restart my docker, but I am wondering if all the experiments history & logs will be disappear as result of the restart.
When deploying the server, you were supposed to create and mount a local data directory (for Trains Server, this is usually /opt/trains/data
). Assuming the server uses that mounted directory for data storage, you shouldn't lose anything. Making sure this is the case can be done in several ways (here are a few "primitive" ones, in order of ascending confidence 🙂):
cd /opt/trains/data
and then listing files in the various subfolders to see there's something there, and that it was recently modified (i.e. ls -la elastic
and ls -la mongo
)bash
inside the respective docker container (e.g. docker exec -t trains-elastic /bin/bash
), then looking in the internally mounted folder. If the temp file can be seen there, the container is mounted correctly and the data will not vanish on restartMy docker-compose file:
version: "3.6"
services:
apiserver:
command:
- apiserver
container_name: trains-apiserver
image: allegroai/trains:0.14.2
restart: unless-stopped
volumes:
- /home/orpat/trains/logs:/var/log/trains
- /home/orpat/trains/config:/opt/trains/config
depends_on:
- redis
- mongo
- elasticsearch
- fileserver
environment:
TRAINS_ELASTIC_SERVICE_HOST: elasticsearch
TRAINS_ELASTIC_SERVICE_PORT: 9200
TRAINS_MONGODB_SERVICE_HOST: mongo
TRAINS_MONGODB_SERVICE_PORT: 27017
TRAINS_REDIS_SERVICE_HOST: redis
TRAINS_REDIS_SERVICE_PORT: 6379
TRAINS__apiserver__mongo__pre_populate__enabled: "true"
TRAINS__apiserver__mongo__pre_populate__zip_file: "/home/orpat/trains/db-pre-populate/export.zip"
ports:
- "8008:8008"
networks:
- backend
elasticsearch:
networks:
- backend
container_name: trains-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: trains
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
discovery.zen.minimum_master_nodes: "1"
http.compression_level: "7"
node.ingest: "true"
node.name: trains
reindex.remote.whitelist: '*.*'
script.inline: "true"
script.painless.regex.enabled: "true"
script.update: "true"
thread_pool.bulk.queue_size: "2000"
thread_pool.search.queue_size: "10000"
xpack.monitoring.enabled: "false"
xpack.security.enabled: "false"
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
image: docker.elastic.co/elasticsearch/elasticsearch:5.6.16
restart: unless-stopped
volumes:
- /home/orpat/trains/data/elastic:/usr/share/elasticsearch/data
ports:
- "9200:9200"
fileserver:
networks:
- backend
command:
- fileserver
container_name: trains-fileserver
image: allegroai/trains:latest
restart: unless-stopped
volumes:
- /home/orpat/trains/logs:/var/log/trains
- /home/orpat/trains/data/fileserver:/mnt/fileserver
ports:
- "8081:8081"
mongo:
networks:
- backend
container_name: trains-mongo
image: mongo:3.6.5
restart: unless-stopped
command: --setParameter internalQueryExecMaxBlockingSortBytes=196100200
volumes:
- /home/orpat/trains/data/mongo/db:/data/db
- /home/orpat/trains/data/mongo/configdb:/data/configdb
ports:
- "27017:27017"
redis:
networks:
- backend
container_name: trains-redis
image: redis:5.0
restart: unless-stopped
volumes:
- /home/orpat/trains/data/redis:/data
ports:
- "6379:6379"
webserver:
command:
- webserver
container_name: trains-webserver
image: allegroai/trains:latest
restart: unless-stopped
volumes:
- /home/orpat/trains/logs:/var/log/trains
depends_on:
- apiserver
ports:
- "8080:80"
networks:
backend:
driver: bridge
I don't find any opt/trains/data
folder on my machine. (It wasn't me who installed the Trains framework).
Do you can know where the mount local directory is from docker-compose.yaml
file?
Yeah @BrianG13, according to the docker-compose.yml
, everything should be under /home/orpat/trains/data
Update:
I made a restart to the docker. Now from browser I can see experiment logs, but when I am accessing 'Scalars' tab, I still have the same error :(
2021-03-06 14:32:03,622 - trains.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error: err=('15 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'f00193b3880869e7f546a55d65b98f67', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863401, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 0', 'value': 578.055079569548, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'db98b3c4f32c97e27d09dd7f74d86ce5', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863402, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 0', 'value': 51.464706368495534, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '1f985d35d0bc3f6ab5d3eafd4a47f30d', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_0', 'value': 7.016663551330566, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '8bd4df956e983dcee6c59e340b82dd6e', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_0', 'value': 3.025986543026301, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '0bcf50b533cb318d83d7e8c9f8deb663', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_0', 'value': 4.204140663146973, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '9283001e48e8b06ed89ccf3161bbdfdd', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 1', 'value': 560.4806488044111, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'ef6a60450fba482b9f8a24c590d22cd7', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 1', 'value': 51.1660772014198, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'eb33a4f6ea0d94dfb1288047f328b20b', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_1', 'value': 7.061158180236816, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '0e694d220867c7d379f4b3984d9822cf', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_1', 'value': 3.009880605171476, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '605feadca469847f68383c89cf5523d5', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_1', 'value': 4.164150238037109, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '907cee8f52228bf47a42ae56e944b796', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 2', 'value': 615.9076488887576, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '5e464f8c3b9340b39c1b5edcbe2b6574', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 2', 'value': 51.178997305144016, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '113d60664a45cb223d83bac731440fdc', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_2', 'value': 7.006958961486816, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '65d4bae76f2cb906eca5b25efad8976e', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_2', 'value': 3.1044640400248826, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '6a2f9498e76295154db186d6a0ec7266', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_2', 'value': 4.520617485046387, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}]), extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh])>
2021-03-06 14:32:03,623 - trains.Metrics - ERROR - Failed reporting metrics: <500/100: events.add_batch/v1.0 (General data error: err=('15 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'f00193b3880869e7f546a55d65b98f67', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863401, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 0', 'value': 578.055079569548, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'db98b3c4f32c97e27d09dd7f74d86ce5', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863402, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 0', 'value': 51.464706368495534, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '1f985d35d0bc3f6ab5d3eafd4a47f30d', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_0', 'value': 7.016663551330566, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '8bd4df956e983dcee6c59e340b82dd6e', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_0', 'value': 3.025986543026301, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '0bcf50b533cb318d83d7e8c9f8deb663', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_0', 'value': 4.204140663146973, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '9283001e48e8b06ed89ccf3161bbdfdd', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 1', 'value': 560.4806488044111, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'ef6a60450fba482b9f8a24c590d22cd7', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 1', 'value': 51.1660772014198, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'eb33a4f6ea0d94dfb1288047f328b20b', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_1', 'value': 7.061158180236816, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '0e694d220867c7d379f4b3984d9822cf', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_1', 'value': 3.009880605171476, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '605feadca469847f68383c89cf5523d5', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_1', 'value': 4.164150238037109, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '907cee8f52228bf47a42ae56e944b796', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 2', 'value': 615.9076488887576, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '5e464f8c3b9340b39c1b5edcbe2b6574', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 2', 'value': 51.178997305144016, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '113d60664a45cb223d83bac731440fdc', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_2', 'value': 7.006958961486816, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '65d4bae76f2cb906eca5b25efad8976e', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_2', 'value': 3.1044640400248826, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '6a2f9498e76295154db186d6a0ec7266', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_2', 'value': 4.520617485046387, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}]), extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh])>
There is an option to use ClearML with my old Trains data ?
Sure 🙂 Which version of Trains Server are you running?
( v0.14 )
You will have to follow the migration procedure which is required when upgrading from server versions <= 0.15, see here: https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_server_es7_migration.html
Hi,
I am using Trains( v0.14 ) (yes, I know there is a new version - ClearML, and I should upgrade, I am in a middle of a project that ends in 2 weeks and I will make the upgrade).
Meanwhile, I suddenly encounter a problem that I can't access to any experiment on my dashboard. The error I am receiving on a pop-up is:
Any idea why this is happening & How can I fix it??
Something I suddenly noticed is that my terminal now is printing something like this:
I will appreciate your help, Thanks!