allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.56k stars 643 forks source link

Remove datasets from ClearML's file server when deleting them through the UI #801

Open montmejat opened 1 year ago

montmejat commented 1 year ago

Proposal Summary

I'm saving my ClearML datasets directly to the file server (I'm not using AWS, Azure, or whatever and I'm self hosted). This is very handy to us because we can easily download them from ClearML, and it's faster over our local network than the cloud. However, when deleting a dataset through the UI interface, it's doesn't remove the files located in /opt/clearml/data/fileserver/project/.dataset/....

Motivation

We have multiple hundreds of GB of data, and this will cause our storage to expand pretty quickly. Manually deleting the datasets is a bit troublesome, and I would like avoiding that.

Related Discussion

Initial slack thread: https://clearml.slack.com/archives/CTK20V944/p1665736071874939.

jkhenning commented 1 year ago

@aurelien-m can you send the docker-compose.yml file you're using?

montmejat commented 1 year ago

Yes sure:

version: "3.6"
services:

  apiserver:
    command:
    - apiserver
    container_name: clearml-apiserver
    image: allegroai/clearml:latest
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/config:/opt/clearml/config
    - /opt/clearml/data/fileserver:/mnt/fileserver
    depends_on:
      - redis
      - mongo
      - elasticsearch
      - fileserver
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      CLEARML_SERVER_DEPLOYMENT_TYPE: ${CLEARML_SERVER_DEPLOYMENT_TYPE:-linux}
      CLEARML__apiserver__pre_populate__enabled: "true"
      CLEARML__apiserver__pre_populate__zip_files: "/opt/clearml/db-pre-populate"
      CLEARML__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
      CLEARML__services__async_urls_delete__enabled: "true"
    ports:
    - "8008:8008"
    networks:
      - backend
      - frontend

  elasticsearch:
    networks:
      - backend
    container_name: clearml-elastic
    environment:
      ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
      ELASTIC_PASSWORD: ${ELASTIC_PASSWORD}
      bootstrap.memory_lock: "true"
      cluster.name: clearml
      cluster.routing.allocation.node_initial_primaries_recoveries: "500"
      cluster.routing.allocation.disk.watermark.low: 500mb
      cluster.routing.allocation.disk.watermark.high: 500mb
      cluster.routing.allocation.disk.watermark.flood_stage: 500mb
      discovery.zen.minimum_master_nodes: "1"
      discovery.type: "single-node"
      http.compression_level: "7"
      node.ingest: "true"
      node.name: clearml
      reindex.remote.whitelist: '*.*'
      xpack.monitoring.enabled: "false"
      xpack.security.enabled: "false"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    image: docker.elastic.co/elasticsearch/elasticsearch:7.16.2
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/elastic_7:/usr/share/elasticsearch/data
      - /usr/share/elasticsearch/logs

  fileserver:
    networks:
      - backend
      - frontend
    command:
    - fileserver
    container_name: clearml-fileserver
    image: allegroai/clearml:latest
    environment:
      CLEARML__fileserver__delete__allow_batch: "true"
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/data/fileserver:/mnt/fileserver
    - /opt/clearml/config:/opt/clearml/config
    ports:
    - "8081:8081"

  mongo:
    networks:
      - backend
    container_name: clearml-mongo
    image: mongo:4.4.9
    restart: unless-stopped
    command: --setParameter internalQueryMaxBlockingSortMemoryUsageBytes=196100200
    volumes:
    - /opt/clearml/data/mongo_4/db:/data/db
    - /opt/clearml/data/mongo_4/configdb:/data/configdb

  redis:
    networks:
      - backend
    container_name: clearml-redis
    image: redis:5.0
    restart: unless-stopped
    volumes:
    - /opt/clearml/data/redis:/data

  webserver:
    command:
    - webserver
    container_name: clearml-webserver
    # environment:
    #  CLEARML_SERVER_SUB_PATH : clearml-web # Allow Clearml to be served with a URL path prefix.
    image: allegroai/clearml:latest
    restart: unless-stopped
    depends_on:
      - apiserver
    ports:
    - "8080:80"
    networks:
      - backend
      - frontend

  async_delete:
    depends_on:
      - apiserver
      - redis
      - mongo
      - elasticsearch
      - fileserver
    container_name: async_delete
    image: allegroai/clearml:latest
    networks:
      - backend
    restart: unless-stopped
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      PYTHONPATH: /opt/clearml/apiserver
    entrypoint:
      - python3
      - -m
      - jobs.async_urls_delete
      - --fileserver-host
      - http://fileserver:8081
    volumes:
      - /opt/clearml/logs:/var/log/clearml

  agent-services:
    networks:
      - backend
    container_name: clearml-agent-services
    image: allegroai/clearml-agent-services:latest
    deploy:
      restart_policy:
        condition: on-failure
    privileged: true
    environment:
      CLEARML_HOST_IP: ${CLEARML_HOST_IP}
      CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-}
      CLEARML_API_HOST: http://apiserver:8008
      CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-}
      CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-}
      CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY:-}
      CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
      CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
      CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:-">=0.17.0"}
      CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
      AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
      AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
      AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
      GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
      CLEARML_WORKER_ID: "clearml-services"
      CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
      SHUTDOWN_IF_NO_ACCESS_KEY: 1
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /opt/clearml/agent:/root/.clearml
    depends_on:
      - apiserver
    entrypoint: >
      bash -c "curl --retry 10 --retry-delay 10 --retry-connrefused 'http://apiserver:8008/debug.ping' && /usr/agent/entrypoint.sh"

networks:
  backend:
    driver: bridge
  frontend:
    driver: bridge

I just tested it again and the files still don't get deleted

evg-allegro commented 1 year ago

Hi @aurelien-m , can you please send us the logs from the async_delete service? This service is actually responsible for removing of the task artifacts from the fileserver. You can get these logs into a file by typing: sudo docker logs async_delete > delete.log 2>&1

montmejat commented 1 year ago

Here are the logs:

[2022-10-07 13:12:46,432] [1] [INFO] [clearml.redis_manager] Using override redis host redis
[2022-10-07 13:12:46,432] [1] [INFO] [clearml.redis_manager] Using override redis port 6379
[2022-10-07 13:12:46,442] [1] [INFO] [clearml.database] Initializing database connections
[2022-10-07 13:12:46,442] [1] [INFO] [clearml.database] Using override mongodb host mongo
[2022-10-07 13:12:46,442] [1] [INFO] [clearml.database] Using override mongodb port 27017
[2022-10-07 13:12:46,444] [1] [INFO] [clearml.database] Registering connection to auth-db (mongodb://mongo:27017/auth)
[2022-10-07 13:12:46,445] [1] [INFO] [clearml.database] Registering connection to backend-db (mongodb://mongo:27017/backend)
[2022-10-11 15:37:44,777] [1] [INFO] [clearml.redis_manager] Using override redis host redis
[2022-10-11 15:37:44,974] [1] [INFO] [clearml.redis_manager] Using override redis port 6379
[2022-10-11 15:37:45,104] [1] [INFO] [clearml.database] Initializing database connections
[2022-10-11 15:37:45,105] [1] [INFO] [clearml.database] Using override mongodb host mongo
[2022-10-11 15:37:45,105] [1] [INFO] [clearml.database] Using override mongodb port 27017
[2022-10-11 15:37:45,111] [1] [INFO] [clearml.database] Registering connection to auth-db (mongodb://mongo:27017/auth)
[2022-10-11 15:37:45,118] [1] [INFO] [clearml.database] Registering connection to backend-db (mongodb://mongo:27017/backend)
Traceback (most recent call last):
  File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1515, in _retryable_read
    read_pref, session, address=address)
  File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1346, in _select_server
    server = topology.select_server(server_selector)
  File "/usr/local/lib64/python3.6/site-packages/pymongo/topology.py", line 246, in select_server
    address))
Loading config from /opt/clearml/apiserver/config/default
Loading config from file /opt/clearml/apiserver/config/default/logging.conf
Loading config from file /opt/clearml/apiserver/config/default/hosts.conf
Loading config from file /opt/clearml/apiserver/config/default/secure.conf
Loading config from file /opt/clearml/apiserver/config/default/apiserver.conf
Loading config from file /opt/clearml/apiserver/config/default/services/events.conf
Loading config from file /opt/clearml/apiserver/config/default/services/organization.conf
Loading config from file /opt/clearml/apiserver/config/default/services/auth.conf
Loading config from file /opt/clearml/apiserver/config/default/services/tasks.conf
Loading config from file /opt/clearml/apiserver/config/default/services/_mongo.conf
Loading config from file /opt/clearml/apiserver/config/default/services/queues.conf
Loading config from file /opt/clearml/apiserver/config/default/services/models.conf
Loading config from file /opt/clearml/apiserver/config/default/services/async_urls_delete.conf
Loading config from file /opt/clearml/apiserver/config/default/services/projects.conf
Loading config from /opt/clearml/config
  File "/usr/local/lib64/python3.6/site-packages/pymongo/topology.py", line 203, in select_servers
    selector, server_timeout, address)
  File "/usr/local/lib64/python3.6/site-packages/pymongo/topology.py", line 220, in _select_servers_loop
    (self._error_message(selector), timeout, self.description))
pymongo.errors.ServerSelectionTimeoutError: mongo:27017: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 63458dc90cf4f6e02f9aa981, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('mongo:27017: [Errno 111] Connection refused',)>]>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/clearml/apiserver/jobs/async_urls_delete.py", line 174, in <module>
    main()
  File "/opt/clearml/apiserver/jobs/async_urls_delete.py", line 170, in main
    run_delete_loop(args.fileserver_host)
  File "/opt/clearml/apiserver/jobs/async_urls_delete.py", line 142, in run_delete_loop
    ).order_by("retry_count").limit(1).first()
  File "/usr/local/lib/python3.6/site-packages/mongoengine/queryset/base.py", line 290, in first
    result = queryset[0]
  File "/usr/local/lib/python3.6/site-packages/mongoengine/queryset/base.py", line 200, in __getitem__
    queryset._cursor[key],
  File "/usr/local/lib64/python3.6/site-packages/pymongo/cursor.py", line 692, in __getitem__
    for doc in clone:
  File "/usr/local/lib64/python3.6/site-packages/pymongo/cursor.py", line 1238, in next
    if len(self.__data) or self._refresh():
  File "/usr/local/lib64/python3.6/site-packages/pymongo/cursor.py", line 1155, in _refresh
    self.__send_message(q)
  File "/usr/local/lib64/python3.6/site-packages/pymongo/cursor.py", line 1045, in __send_message
    operation, self._unpack_response, address=self.__address)
  File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1426, in _run_operation
    address=address, retryable=isinstance(operation, message._Query))
  File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1531, in _retryable_read
    raise last_error
  File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1525, in _retryable_read
    return func(session, server, sock_info, secondary_ok)
  File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1422, in _cmd
    unpack_res)
  File "/usr/local/lib64/python3.6/site-packages/pymongo/server.py", line 114, in run_operation
    reply = sock_info.receive_message(request_id)
  File "/usr/local/lib64/python3.6/site-packages/pymongo/pool.py", line 753, in receive_message
    self._raise_connection_failure(error)
  File "/usr/local/lib64/python3.6/site-packages/pymongo/pool.py", line 929, in _raise_connection_failure
    _raise_connection_failure(self.address, error)
  File "/usr/local/lib64/python3.6/site-packages/pymongo/pool.py", line 247, in _raise_connection_failure
    raise AutoReconnect(msg)
pymongo.errors.AutoReconnect: mongo:27017: [Errno 104] Connection reset by peer
[2022-10-13 20:49:40,033] [1] [INFO] [clearml.redis_manager] Using override redis host redis
[2022-10-13 20:49:40,033] [1] [INFO] [clearml.redis_manager] Using override redis port 6379
[2022-10-13 20:49:40,066] [1] [INFO] [clearml.database] Initializing database connections
[2022-10-13 20:49:40,066] [1] [INFO] [clearml.database] Using override mongodb host mongo
[2022-10-13 20:49:40,067] [1] [INFO] [clearml.database] Using override mongodb port 27017
[2022-10-13 20:49:40,069] [1] [INFO] [clearml.database] Registering connection to auth-db (mongodb://mongo:27017/auth)
[2022-10-13 20:49:40,073] [1] [INFO] [clearml.database] Registering connection to backend-db (mongodb://mongo:27017/backend)

The file is actually pretty small, there's all of it here. I feel like it's missing some logs, like from today where I tried deleting a dataset for example

evg-allegro commented 1 year ago

@aurelien-m if you run 'sudo docker ps' then what is the status of the async_delete service? Is it running? For how much time?

montmejat commented 1 year ago

Apparently it's running:

$ sudo docker ps
CONTAINER ID   IMAGE                                                  COMMAND                  CREATED       STATUS      PORTS                                                            NAMES
c15098132d96   allegroai/clearml:latest                               "python3 -m jobs.asy…"   10 days ago   Up 4 days   8008/tcp, 8080-8081/tcp                                          async_delete
ceafe5efc5d1   allegroai/clearml:latest                               "/opt/clearml/wrappe…"   10 days ago   Up 6 days   8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp, :::8080->80/tcp   clearml-webserver
74cdd27e5347   allegroai/clearml:latest                               "/opt/clearml/wrappe…"   10 days ago   Up 6 days   0.0.0.0:8008->8008/tcp, :::8008->8008/tcp, 8080-8081/tcp         clearml-apiserver
3fe403e3be81   redis:5.0                                              "docker-entrypoint.s…"   10 days ago   Up 6 days   6379/tcp                                                         clearml-redis
02f50681cc6d   docker.elastic.co/elasticsearch/elasticsearch:7.16.2   "/bin/tini -- /usr/l…"   10 days ago   Up 6 days   9200/tcp, 9300/tcp                                               clearml-elastic
7886919c20bf   mongo:4.4.9                                            "docker-entrypoint.s…"   10 days ago   Up 4 days   27017/tcp                                                        clearml-mongo
c9315497b44b   allegroai/clearml:latest                               "/opt/clearml/wrappe…"   10 days ago   Up 6 days   8008/tcp, 8080/tcp, 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp    clearml-fileserver
evg-allegro commented 1 year ago

@aurelien-m It seems that the service is running OK. Another reason could be that the dataset urls are not recognized as valid links for deletion from the fileserver. Can you please share how the link to the dataset file looks in UI?

montmejat commented 1 year ago

Sure, I've changed the project's name, but it's very close to what you see here:

http://172.16.3.178:8081/Project-Name/.datasets/ProjectName%20-%20The%20Dataset%20Name%20-%20Training/ProjectName%20-%20The%20Dataset%20Name%20-%20Training.5eecfe5ebc8349ff866c8fda9223c2b7/artifacts/data/dataset.5eecfe5ebc8349ff866c8fda9223c2b7.l6u3x6m3.zip

I have spaces in the name, can it be an issue?

evg-allegro commented 1 year ago

Thanks, I think that we identified the issue. Let me check what can be done

montmejat commented 1 year ago

Hey @evg-allegro! I'm just checking if there's anything that can be done about it? :smile: I still have to delete them manually and through the UI

erezalg commented 1 year ago

@aurelien-m,

This will be a part of the next server release, planned later this week or early next week :smile:

montmejat commented 1 year ago

Awesome!

pollfly commented 1 year ago

Hey @aurelien-m ! Just letting you know that this issue has been resolved in clearml-server v.1.8.0. Let us know if there are any issues :)

ColdTeapot273K commented 1 year ago

@pollfly we cannot even delete datasets via UI - had to delete them directly in filesystem, then in the UI

clearml-server v.1.9.2

same location /opt/clearml/data/fileserver/project/.dataset/....

image

evg-allegro commented 1 year ago

Hi @ColdTeapot273K , can you please share the following info:

  1. The error response from the server on the dataset delete. You can get it from the network tab in browser developer tools (F12 in Chrome)
  2. The extract from the apiserver log that would contain the error: sudo docker logs -n 10000 clearml-apiserver > api.log 2>&1
  3. The async_delete service log: sudo docker logs async_delete > delete.log 2>&1