allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.66k stars 652 forks source link

Error 100 : General data error (TransportError(503, 'search_phase_execution_exception')) #316

Open BrianG13 opened 3 years ago

BrianG13 commented 3 years ago

Hi,

I am using Trains( v0.14 ) (yes, I know there is a new version - ClearML, and I should upgrade, I am in a middle of a project that ends in 2 weeks and I will make the upgrade).

Meanwhile, I suddenly encounter a problem that I can't access to any experiment on my dashboard. The error I am receiving on a pop-up is: image image

Any idea why this is happening & How can I fix it??

Something I suddenly noticed is that my terminal now is printing something like this:

2021-03-02 17:17:24,454 - trains.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'a274e8b6edf243f0bf01b4f05fa38dca', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][event][a274e8b6edf243f0bf01b4f05fa38dca], source[{"timestamp": 1614698139485, "type": "log", "task": "c4bf15bf31e74071b1d9c91bb219d91b", "level": "info", "worker": "momo", "msg": "Processing Subject: S1 , Action: Directions , Cameras: (0, 1, 2, 3)\nProcessing Subject: S1 , Action: Directions 1 , Cameras: (0, 1, 2, 3)\nProcessing Subject: S1 , Action: Discussion , Cameras: (0, 1, 2, 3)\nProcessing Subject: S1 , Action: Discussion 1 , Cameras: (0, 1, 2, 3)", "@timestamp": "2021-03-02T15:16:24.445Z", "metric": "", "variant": ""}]}] and a refresh]'}, 'data': {'timestamp': 1614698139485, 'type': 'log', 'task': 'c4bf15bf31e74071b1d9c91bb219d91b', 'level': 'info', 'worker': 'momo', 'msg': 'Processing Subject: S1 , Action: Directions , Cameras: (0, 1, 2, 3)\nProcessing Subject: S1 , Action: Directions 1 , Cameras: (0, 1, 2, 3)\nProcessing Subject: S1 , Action: Discussion , Cameras: (0, 1, 2, 3)\nProcessing Subject: S1 , Action: Discussion 1 , Cameras: (0, 1, 2, 3)', '@timestamp': '2021-03-02T15:16:24.445Z', 'metric': '', 'variant': ''}}}]), extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [index {[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][event][a274e8b6edf243f0bf01b4f05fa38dca], source[{"timestamp": 1614698139485, "type": "log", "task": "c4bf15bf31e74071b1d9c91bb219d91b", "level": "info", "worker": "momo", "msg": "Processing Subject: S1 , Action: Directions , Cameras: (0, 1, 2, 3)\nProcessing Subject: S1 , Action: Directions 1 , Cameras: (0, 1, 2, 3)\nProcessing Subject: S1 , Action: Discussion , Cameras: (0, 1, 2, 3)\nProcessing Subject: S1 , Action: Discussion 1 , Cameras: (0, 1, 2, 3)", "@timestamp": "2021-03-02T15:16:24.445Z", "metric": "", "variant": ""}]}] and a refresh])>) 2021-03-02 17:17:25,142 - trains.log - INFO - Flush timeout 10.0s exceeded, dropping last 26 lines

I will appreciate your help, Thanks!

jkhenning commented 3 years ago

Hi @BrianG13,

This issue (as well as https://github.com/allegroai/clearml/issues/315) seem to be related to insufficient server disk space causing ES to go into read-only mode or turn active shards into inactive or unassigned.

The disk watermarks controlling the ES free-disk constraints are defined by default as % of the disk space (so it might look to you like you still have plenty of space, but ES thinks otherwise).

If you don't have enough free disk space, clean up files to create more, or resize your partition. If you do have enough space, you can configure different ES settings in the docker-compose.yml file (see here - there are 3 settings, all can be identical)

BrianG13 commented 3 years ago

Hi @jkhenning , thanks for you reply!

I free a lot of space in my disk as recommended, but I still can't access any log/plot of any experiment, even old experiments that worked.

There is anything I can do to fix this?

jkhenning commented 3 years ago

How much disk space did you free up? What is the total size of your disk?

BrianG13 commented 3 years ago

I free something like 3T. My disk total size is of 7.3T.

I have another question, I want to try and restart my docker, but I am wondering if all the experiments history & logs will be disappear as result of the restart. image

Do you know what is the worst-case scenario for following the above orders?

jkhenning commented 3 years ago

I free something like 3T. My disk total size is of 7.3T.

Can you share the docker-compose.yml you're using?

I have another question, I want to try and restart my docker, but I am wondering if all the experiments history & logs will be disappear as result of the restart.

When deploying the server, you were supposed to create and mount a local data directory (for Trains Server, this is usually /opt/trains/data). Assuming the server uses that mounted directory for data storage, you shouldn't lose anything. Making sure this is the case can be done in several ways (here are a few "primitive" ones, in order of ascending confidence 🙂):

BrianG13 commented 3 years ago

My docker-compose file:

version: "3.6"
services:

  apiserver:
    command:
    - apiserver
    container_name: trains-apiserver
    image: allegroai/trains:0.14.2
    restart: unless-stopped
    volumes:
    - /home/orpat/trains/logs:/var/log/trains
    - /home/orpat/trains/config:/opt/trains/config
    depends_on:
      - redis
      - mongo
      - elasticsearch
      - fileserver
    environment:
      TRAINS_ELASTIC_SERVICE_HOST: elasticsearch
      TRAINS_ELASTIC_SERVICE_PORT: 9200
      TRAINS_MONGODB_SERVICE_HOST: mongo
      TRAINS_MONGODB_SERVICE_PORT: 27017
      TRAINS_REDIS_SERVICE_HOST: redis
      TRAINS_REDIS_SERVICE_PORT: 6379
      TRAINS__apiserver__mongo__pre_populate__enabled: "true"
      TRAINS__apiserver__mongo__pre_populate__zip_file: "/home/orpat/trains/db-pre-populate/export.zip"
    ports:
    - "8008:8008"
    networks:
      - backend

  elasticsearch:
    networks:
      - backend
    container_name: trains-elastic
    environment:
      ES_JAVA_OPTS: -Xms2g -Xmx2g
      bootstrap.memory_lock: "true"
      cluster.name: trains
      cluster.routing.allocation.node_initial_primaries_recoveries: "500"
      discovery.zen.minimum_master_nodes: "1"
      http.compression_level: "7"
      node.ingest: "true"
      node.name: trains
      reindex.remote.whitelist: '*.*'
      script.inline: "true"
      script.painless.regex.enabled: "true"
      script.update: "true"
      thread_pool.bulk.queue_size: "2000"
      thread_pool.search.queue_size: "10000"
      xpack.monitoring.enabled: "false"
      xpack.security.enabled: "false"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    image: docker.elastic.co/elasticsearch/elasticsearch:5.6.16
    restart: unless-stopped
    volumes:
    - /home/orpat/trains/data/elastic:/usr/share/elasticsearch/data
    ports:
    - "9200:9200"
fileserver:
    networks:
      - backend
    command:
    - fileserver
    container_name: trains-fileserver
    image: allegroai/trains:latest
    restart: unless-stopped
    volumes:
    - /home/orpat/trains/logs:/var/log/trains
    - /home/orpat/trains/data/fileserver:/mnt/fileserver
    ports:
    - "8081:8081"

  mongo:
    networks:
      - backend
    container_name: trains-mongo
    image: mongo:3.6.5
    restart: unless-stopped
    command: --setParameter internalQueryExecMaxBlockingSortBytes=196100200
    volumes:
    - /home/orpat/trains/data/mongo/db:/data/db
    - /home/orpat/trains/data/mongo/configdb:/data/configdb
    ports:
    - "27017:27017"

  redis:
    networks:
      - backend
    container_name: trains-redis
    image: redis:5.0
    restart: unless-stopped
    volumes:
    - /home/orpat/trains/data/redis:/data
    ports:
    - "6379:6379"

  webserver:
    command:
    - webserver
    container_name: trains-webserver
    image: allegroai/trains:latest
    restart: unless-stopped
    volumes:
    - /home/orpat/trains/logs:/var/log/trains
    depends_on:
      - apiserver
    ports:
    - "8080:80"

networks:
  backend:
    driver: bridge

I don't find any opt/trains/data folder on my machine. (It wasn't me who installed the Trains framework). Do you can know where the mount local directory is from docker-compose.yaml file?

jkhenning commented 3 years ago

Yeah @BrianG13, according to the docker-compose.yml, everything should be under /home/orpat/trains/data

BrianG13 commented 3 years ago

Update:

I made a restart to the docker. Now from browser I can see experiment logs, but when I am accessing 'Scalars' tab, I still have the same error :(

2021-03-06 14:32:03,622 - trains.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error: err=('15 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'f00193b3880869e7f546a55d65b98f67', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863401, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 0', 'value': 578.055079569548, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'db98b3c4f32c97e27d09dd7f74d86ce5', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863402, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 0', 'value': 51.464706368495534, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '1f985d35d0bc3f6ab5d3eafd4a47f30d', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_0', 'value': 7.016663551330566, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '8bd4df956e983dcee6c59e340b82dd6e', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_0', 'value': 3.025986543026301, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '0bcf50b533cb318d83d7e8c9f8deb663', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_0', 'value': 4.204140663146973, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '9283001e48e8b06ed89ccf3161bbdfdd', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 1', 'value': 560.4806488044111, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'ef6a60450fba482b9f8a24c590d22cd7', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 1', 'value': 51.1660772014198, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'eb33a4f6ea0d94dfb1288047f328b20b', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_1', 'value': 7.061158180236816, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '0e694d220867c7d379f4b3984d9822cf', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_1', 'value': 3.009880605171476, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '605feadca469847f68383c89cf5523d5', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_1', 'value': 4.164150238037109, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '907cee8f52228bf47a42ae56e944b796', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 2', 'value': 615.9076488887576, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '5e464f8c3b9340b39c1b5edcbe2b6574', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 2', 'value': 51.178997305144016, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '113d60664a45cb223d83bac731440fdc', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_2', 'value': 7.006958961486816, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '65d4bae76f2cb906eca5b25efad8976e', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_2', 'value': 3.1044640400248826, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '6a2f9498e76295154db186d6a0ec7266', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_2', 'value': 4.520617485046387, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}]), extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh])>
2021-03-06 14:32:03,623 - trains.Metrics - ERROR - Failed reporting metrics: <500/100: events.add_batch/v1.0 (General data error: err=('15 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'f00193b3880869e7f546a55d65b98f67', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863401, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 0', 'value': 578.055079569548, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'db98b3c4f32c97e27d09dd7f74d86ce5', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863402, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 0', 'value': 51.464706368495534, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '1f985d35d0bc3f6ab5d3eafd4a47f30d', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_0', 'value': 7.016663551330566, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '8bd4df956e983dcee6c59e340b82dd6e', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_0', 'value': 3.025986543026301, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '0bcf50b533cb318d83d7e8c9f8deb663', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863458, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_0', 'value': 4.204140663146973, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '9283001e48e8b06ed89ccf3161bbdfdd', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 1', 'value': 560.4806488044111, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'ef6a60450fba482b9f8a24c590d22cd7', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 1', 'value': 51.1660772014198, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'eb33a4f6ea0d94dfb1288047f328b20b', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_1', 'value': 7.061158180236816, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '0e694d220867c7d379f4b3984d9822cf', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_1', 'value': 3.009880605171476, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '605feadca469847f68383c89cf5523d5', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863514, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_1', 'value': 4.164150238037109, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '907cee8f52228bf47a42ae56e944b796', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJPE', 'variant': 'View 2', 'value': 615.9076488887576, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '5e464f8c3b9340b39c1b5edcbe2b6574', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Train MPJAE', 'variant': 'View 2', 'value': 51.178997305144016, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '113d60664a45cb223d83bac731440fdc', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_bones_2', 'value': 7.006958961486816, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '65d4bae76f2cb906eca5b25efad8976e', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_angles_2', 'value': 3.1044640400248826, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}, {'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '6a2f9498e76295154db186d6a0ec7266', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh]'}, 'data': {'timestamp': 1615033863568, 'type': 'training_stats_scalar', 'task': '1c1c13c151cd4dcbb16f0f619baec406', 'iter': 1, 'metric': 'Losses', 'variant': 'loss_quat_unit_regulator_2', 'value': 4.520617485046387, '@timestamp': '2021-03-06T12:31:03.604Z', 'worker': 'momo'}}}]), extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [15] requests and a refresh])>
BrianG13 commented 3 years ago

There is an option to use ClearML with my old Trains data ?

jkhenning commented 3 years ago

Sure 🙂 Which version of Trains Server are you running?

BrianG13 commented 3 years ago

( v0.14 )

jkhenning commented 3 years ago

You will have to follow the migration procedure which is required when upgrading from server versions <= 0.15, see here: https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_server_es7_migration.html