Open montmejat opened 1 year ago
@aurelien-m can you send the docker-compose.yml file you're using?
Yes sure:
version: "3.6"
services:
apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
- /opt/clearml/config:/opt/clearml/config
- /opt/clearml/data/fileserver:/mnt/fileserver
depends_on:
- redis
- mongo
- elasticsearch
- fileserver
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
CLEARML_ELASTIC_SERVICE_PORT: 9200
CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
CLEARML_MONGODB_SERVICE_HOST: mongo
CLEARML_MONGODB_SERVICE_PORT: 27017
CLEARML_REDIS_SERVICE_HOST: redis
CLEARML_REDIS_SERVICE_PORT: 6379
CLEARML_SERVER_DEPLOYMENT_TYPE: ${CLEARML_SERVER_DEPLOYMENT_TYPE:-linux}
CLEARML__apiserver__pre_populate__enabled: "true"
CLEARML__apiserver__pre_populate__zip_files: "/opt/clearml/db-pre-populate"
CLEARML__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
CLEARML__services__async_urls_delete__enabled: "true"
ports:
- "8008:8008"
networks:
- backend
- frontend
elasticsearch:
networks:
- backend
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
ELASTIC_PASSWORD: ${ELASTIC_PASSWORD}
bootstrap.memory_lock: "true"
cluster.name: clearml
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
cluster.routing.allocation.disk.watermark.high: 500mb
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
discovery.zen.minimum_master_nodes: "1"
discovery.type: "single-node"
http.compression_level: "7"
node.ingest: "true"
node.name: clearml
reindex.remote.whitelist: '*.*'
xpack.monitoring.enabled: "false"
xpack.security.enabled: "false"
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
image: docker.elastic.co/elasticsearch/elasticsearch:7.16.2
restart: unless-stopped
volumes:
- /opt/clearml/data/elastic_7:/usr/share/elasticsearch/data
- /usr/share/elasticsearch/logs
fileserver:
networks:
- backend
- frontend
command:
- fileserver
container_name: clearml-fileserver
image: allegroai/clearml:latest
environment:
CLEARML__fileserver__delete__allow_batch: "true"
restart: unless-stopped
volumes:
- /opt/clearml/logs:/var/log/clearml
- /opt/clearml/data/fileserver:/mnt/fileserver
- /opt/clearml/config:/opt/clearml/config
ports:
- "8081:8081"
mongo:
networks:
- backend
container_name: clearml-mongo
image: mongo:4.4.9
restart: unless-stopped
command: --setParameter internalQueryMaxBlockingSortMemoryUsageBytes=196100200
volumes:
- /opt/clearml/data/mongo_4/db:/data/db
- /opt/clearml/data/mongo_4/configdb:/data/configdb
redis:
networks:
- backend
container_name: clearml-redis
image: redis:5.0
restart: unless-stopped
volumes:
- /opt/clearml/data/redis:/data
webserver:
command:
- webserver
container_name: clearml-webserver
# environment:
# CLEARML_SERVER_SUB_PATH : clearml-web # Allow Clearml to be served with a URL path prefix.
image: allegroai/clearml:latest
restart: unless-stopped
depends_on:
- apiserver
ports:
- "8080:80"
networks:
- backend
- frontend
async_delete:
depends_on:
- apiserver
- redis
- mongo
- elasticsearch
- fileserver
container_name: async_delete
image: allegroai/clearml:latest
networks:
- backend
restart: unless-stopped
environment:
CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
CLEARML_ELASTIC_SERVICE_PORT: 9200
CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
CLEARML_MONGODB_SERVICE_HOST: mongo
CLEARML_MONGODB_SERVICE_PORT: 27017
CLEARML_REDIS_SERVICE_HOST: redis
CLEARML_REDIS_SERVICE_PORT: 6379
PYTHONPATH: /opt/clearml/apiserver
entrypoint:
- python3
- -m
- jobs.async_urls_delete
- --fileserver-host
- http://fileserver:8081
volumes:
- /opt/clearml/logs:/var/log/clearml
agent-services:
networks:
- backend
container_name: clearml-agent-services
image: allegroai/clearml-agent-services:latest
deploy:
restart_policy:
condition: on-failure
privileged: true
environment:
CLEARML_HOST_IP: ${CLEARML_HOST_IP}
CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-}
CLEARML_API_HOST: http://apiserver:8008
CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-}
CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-}
CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY:-}
CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:-">=0.17.0"}
CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
CLEARML_WORKER_ID: "clearml-services"
CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
SHUTDOWN_IF_NO_ACCESS_KEY: 1
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /opt/clearml/agent:/root/.clearml
depends_on:
- apiserver
entrypoint: >
bash -c "curl --retry 10 --retry-delay 10 --retry-connrefused 'http://apiserver:8008/debug.ping' && /usr/agent/entrypoint.sh"
networks:
backend:
driver: bridge
frontend:
driver: bridge
I just tested it again and the files still don't get deleted
Hi @aurelien-m , can you please send us the logs from the async_delete service? This service is actually responsible for removing of the task artifacts from the fileserver. You can get these logs into a file by typing: sudo docker logs async_delete > delete.log 2>&1
Here are the logs:
[2022-10-07 13:12:46,432] [1] [INFO] [clearml.redis_manager] Using override redis host redis
[2022-10-07 13:12:46,432] [1] [INFO] [clearml.redis_manager] Using override redis port 6379
[2022-10-07 13:12:46,442] [1] [INFO] [clearml.database] Initializing database connections
[2022-10-07 13:12:46,442] [1] [INFO] [clearml.database] Using override mongodb host mongo
[2022-10-07 13:12:46,442] [1] [INFO] [clearml.database] Using override mongodb port 27017
[2022-10-07 13:12:46,444] [1] [INFO] [clearml.database] Registering connection to auth-db (mongodb://mongo:27017/auth)
[2022-10-07 13:12:46,445] [1] [INFO] [clearml.database] Registering connection to backend-db (mongodb://mongo:27017/backend)
[2022-10-11 15:37:44,777] [1] [INFO] [clearml.redis_manager] Using override redis host redis
[2022-10-11 15:37:44,974] [1] [INFO] [clearml.redis_manager] Using override redis port 6379
[2022-10-11 15:37:45,104] [1] [INFO] [clearml.database] Initializing database connections
[2022-10-11 15:37:45,105] [1] [INFO] [clearml.database] Using override mongodb host mongo
[2022-10-11 15:37:45,105] [1] [INFO] [clearml.database] Using override mongodb port 27017
[2022-10-11 15:37:45,111] [1] [INFO] [clearml.database] Registering connection to auth-db (mongodb://mongo:27017/auth)
[2022-10-11 15:37:45,118] [1] [INFO] [clearml.database] Registering connection to backend-db (mongodb://mongo:27017/backend)
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1515, in _retryable_read
read_pref, session, address=address)
File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1346, in _select_server
server = topology.select_server(server_selector)
File "/usr/local/lib64/python3.6/site-packages/pymongo/topology.py", line 246, in select_server
address))
Loading config from /opt/clearml/apiserver/config/default
Loading config from file /opt/clearml/apiserver/config/default/logging.conf
Loading config from file /opt/clearml/apiserver/config/default/hosts.conf
Loading config from file /opt/clearml/apiserver/config/default/secure.conf
Loading config from file /opt/clearml/apiserver/config/default/apiserver.conf
Loading config from file /opt/clearml/apiserver/config/default/services/events.conf
Loading config from file /opt/clearml/apiserver/config/default/services/organization.conf
Loading config from file /opt/clearml/apiserver/config/default/services/auth.conf
Loading config from file /opt/clearml/apiserver/config/default/services/tasks.conf
Loading config from file /opt/clearml/apiserver/config/default/services/_mongo.conf
Loading config from file /opt/clearml/apiserver/config/default/services/queues.conf
Loading config from file /opt/clearml/apiserver/config/default/services/models.conf
Loading config from file /opt/clearml/apiserver/config/default/services/async_urls_delete.conf
Loading config from file /opt/clearml/apiserver/config/default/services/projects.conf
Loading config from /opt/clearml/config
File "/usr/local/lib64/python3.6/site-packages/pymongo/topology.py", line 203, in select_servers
selector, server_timeout, address)
File "/usr/local/lib64/python3.6/site-packages/pymongo/topology.py", line 220, in _select_servers_loop
(self._error_message(selector), timeout, self.description))
pymongo.errors.ServerSelectionTimeoutError: mongo:27017: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 63458dc90cf4f6e02f9aa981, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('mongo:27017: [Errno 111] Connection refused',)>]>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/clearml/apiserver/jobs/async_urls_delete.py", line 174, in <module>
main()
File "/opt/clearml/apiserver/jobs/async_urls_delete.py", line 170, in main
run_delete_loop(args.fileserver_host)
File "/opt/clearml/apiserver/jobs/async_urls_delete.py", line 142, in run_delete_loop
).order_by("retry_count").limit(1).first()
File "/usr/local/lib/python3.6/site-packages/mongoengine/queryset/base.py", line 290, in first
result = queryset[0]
File "/usr/local/lib/python3.6/site-packages/mongoengine/queryset/base.py", line 200, in __getitem__
queryset._cursor[key],
File "/usr/local/lib64/python3.6/site-packages/pymongo/cursor.py", line 692, in __getitem__
for doc in clone:
File "/usr/local/lib64/python3.6/site-packages/pymongo/cursor.py", line 1238, in next
if len(self.__data) or self._refresh():
File "/usr/local/lib64/python3.6/site-packages/pymongo/cursor.py", line 1155, in _refresh
self.__send_message(q)
File "/usr/local/lib64/python3.6/site-packages/pymongo/cursor.py", line 1045, in __send_message
operation, self._unpack_response, address=self.__address)
File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1426, in _run_operation
address=address, retryable=isinstance(operation, message._Query))
File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1531, in _retryable_read
raise last_error
File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1525, in _retryable_read
return func(session, server, sock_info, secondary_ok)
File "/usr/local/lib64/python3.6/site-packages/pymongo/mongo_client.py", line 1422, in _cmd
unpack_res)
File "/usr/local/lib64/python3.6/site-packages/pymongo/server.py", line 114, in run_operation
reply = sock_info.receive_message(request_id)
File "/usr/local/lib64/python3.6/site-packages/pymongo/pool.py", line 753, in receive_message
self._raise_connection_failure(error)
File "/usr/local/lib64/python3.6/site-packages/pymongo/pool.py", line 929, in _raise_connection_failure
_raise_connection_failure(self.address, error)
File "/usr/local/lib64/python3.6/site-packages/pymongo/pool.py", line 247, in _raise_connection_failure
raise AutoReconnect(msg)
pymongo.errors.AutoReconnect: mongo:27017: [Errno 104] Connection reset by peer
[2022-10-13 20:49:40,033] [1] [INFO] [clearml.redis_manager] Using override redis host redis
[2022-10-13 20:49:40,033] [1] [INFO] [clearml.redis_manager] Using override redis port 6379
[2022-10-13 20:49:40,066] [1] [INFO] [clearml.database] Initializing database connections
[2022-10-13 20:49:40,066] [1] [INFO] [clearml.database] Using override mongodb host mongo
[2022-10-13 20:49:40,067] [1] [INFO] [clearml.database] Using override mongodb port 27017
[2022-10-13 20:49:40,069] [1] [INFO] [clearml.database] Registering connection to auth-db (mongodb://mongo:27017/auth)
[2022-10-13 20:49:40,073] [1] [INFO] [clearml.database] Registering connection to backend-db (mongodb://mongo:27017/backend)
The file is actually pretty small, there's all of it here. I feel like it's missing some logs, like from today where I tried deleting a dataset for example
@aurelien-m if you run 'sudo docker ps' then what is the status of the async_delete service? Is it running? For how much time?
Apparently it's running:
$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c15098132d96 allegroai/clearml:latest "python3 -m jobs.asy…" 10 days ago Up 4 days 8008/tcp, 8080-8081/tcp async_delete
ceafe5efc5d1 allegroai/clearml:latest "/opt/clearml/wrappe…" 10 days ago Up 6 days 8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp, :::8080->80/tcp clearml-webserver
74cdd27e5347 allegroai/clearml:latest "/opt/clearml/wrappe…" 10 days ago Up 6 days 0.0.0.0:8008->8008/tcp, :::8008->8008/tcp, 8080-8081/tcp clearml-apiserver
3fe403e3be81 redis:5.0 "docker-entrypoint.s…" 10 days ago Up 6 days 6379/tcp clearml-redis
02f50681cc6d docker.elastic.co/elasticsearch/elasticsearch:7.16.2 "/bin/tini -- /usr/l…" 10 days ago Up 6 days 9200/tcp, 9300/tcp clearml-elastic
7886919c20bf mongo:4.4.9 "docker-entrypoint.s…" 10 days ago Up 4 days 27017/tcp clearml-mongo
c9315497b44b allegroai/clearml:latest "/opt/clearml/wrappe…" 10 days ago Up 6 days 8008/tcp, 8080/tcp, 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp clearml-fileserver
@aurelien-m It seems that the service is running OK. Another reason could be that the dataset urls are not recognized as valid links for deletion from the fileserver. Can you please share how the link to the dataset file looks in UI?
Sure, I've changed the project's name, but it's very close to what you see here:
http://172.16.3.178:8081/Project-Name/.datasets/ProjectName%20-%20The%20Dataset%20Name%20-%20Training/ProjectName%20-%20The%20Dataset%20Name%20-%20Training.5eecfe5ebc8349ff866c8fda9223c2b7/artifacts/data/dataset.5eecfe5ebc8349ff866c8fda9223c2b7.l6u3x6m3.zip
I have spaces in the name, can it be an issue?
Thanks, I think that we identified the issue. Let me check what can be done
Hey @evg-allegro! I'm just checking if there's anything that can be done about it? :smile: I still have to delete them manually and through the UI
@aurelien-m,
This will be a part of the next server release, planned later this week or early next week :smile:
Awesome!
Hey @aurelien-m ! Just letting you know that this issue has been resolved in clearml-server v.1.8.0. Let us know if there are any issues :)
@pollfly we cannot even delete datasets via UI - had to delete them directly in filesystem, then in the UI
clearml-server v.1.9.2
same location
/opt/clearml/data/fileserver/project/.dataset/....
Hi @ColdTeapot273K , can you please share the following info:
Proposal Summary
I'm saving my ClearML datasets directly to the file server (I'm not using AWS, Azure, or whatever and I'm self hosted). This is very handy to us because we can easily download them from ClearML, and it's faster over our local network than the cloud. However, when deleting a dataset through the UI interface, it's doesn't remove the files located in
/opt/clearml/data/fileserver/project/.dataset/...
.Motivation
We have multiple hundreds of GB of data, and this will cause our storage to expand pretty quickly. Manually deleting the datasets is a bit troublesome, and I would like avoiding that.
Related Discussion
Initial slack thread: https://clearml.slack.com/archives/CTK20V944/p1665736071874939.