allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
385 stars 133 forks source link

Migrating data from trains to clearml #70

Open majdzr opened 3 years ago

majdzr commented 3 years ago

Hello,

I tried to upgrade from trains to clearml without much luck. To deal with it, I made a backup of the data from trains and installed a fresh clearml-server. It works just fine as expected. However, when I try to migrate the data from trains (.tar, following the instructions) to clearml, I run into some problems:

  1. If I do it while the clearml containers are up and try to log into the app, it keeps loading forever.
  2. If I do it and then start the clearml containers, it fails and I get so many errors, mainly related to Elasticsearch: clearml-elastic | "stacktrace": ["org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];",.

Any idea how can I save my data?

Btw, I received the same error when trying to upgrade from trains to clearml, regardless of data migration.

Thanks in advance.

jkhenning commented 3 years ago

Hi @majdzr ,

If I do it while the clearml containers are up and try to log into the app, it keeps loading forever.

As a rule, data restore should never be done while the related application is up (especially due to read/write issues, but in this case, also because the startup sequence contains automatic data validation and migration code)

If I do it and then start the clearml containers, it fails and I get so many errors, mainly related to Elasticsearch: clearml-elastic | "stacktrace": ["org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: AccessDeniedException[/usr/share/elasticsearch/data/nodes];"

This looks like folder access issues. After restoring your data (and before starting the server again), try:

sudo chown -R 1000:1000 /opt/clearml