allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
364 stars 132 forks source link

How do I connect a non-AWS S3 bucket? #219

Closed aimakhotka closed 7 months ago

aimakhotka commented 8 months ago

I have a self-hosted ClearML Server, and I've been trying for a few days now to configure the server to connect to my cloud provider's S3 bucket, but it keeps knocking into AWS unless I explicitly specify output_uri in the python file.

I have specified in the Web UI the data for Web App Cloud Access: bucket, key, secret, aws region, endpoint. I created a test_cloud project and specified the artifact storage location as s3://bucket-name. And I edit the config on the client, the latest version is below.

The s3 credentials I have:

``` S3 endpoint - https://n-ws-hk0m2-pd11.s3pd11.sbercloud.ru S3 region - n-ws-hk0m2-pd11 S3 bucket name - b-ws-hk0m2-pd11-r87 S3 access key ID - *** S3 security key - *** address for access? idk b-ws-hk0m2-pd11-r87.b1.s3.sbercloud.ru ```

My clearml.conf

``` # ClearML SDK configuration file api { # Notice: 'host' is the api server (default port 8008), not the web server. api_server: http://localhost:8008 web_server: http://localhost:8080 files_server: https://b-ws-hk0m2-pd11-r87.b1.s3.sbercloud.ru:443/ # Credentials are generated using the webapp, http://62.113.97.251:8080/settings # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY credentials {"access_key": "***", "secret_key": "***"} } sdk { # ClearML - default SDK configuration storage { cache { # Defaults to /clearml_cache default_base_dir: "~/.clearml/cache" # default_cache_manager_size: 100 } direct_access: [ # Objects matching are considered to be available for direct access, i.e. they will not be downloaded # or cached, and any download request will return a direct reference. # Objects are specified in glob format, available for url and content_type. { url: "file://*" } # file-urls are always directly referenced ] } metrics { # History size for debug files per metric/variant. For each metric/variant combination with an attached file # (e.g. debug image event), file names for the uploaded files will be recycled in such a way that no more than # X files are stored in the upload destination for each metric/variant combination. file_history_size: 100 # Max history size for matplotlib imshow files per plot title. # File names for the uploaded images will be recycled in such a way that no more than # X images are stored in the upload destination for each matplotlib plot title. matplotlib_untitled_history_size: 100 # Limit the number of digits after the dot in plot reporting (reducing plot report size) # plot_max_num_digits: 5 # Settings for generated debug images images { format: JPEG quality: 87 subsampling: 0 } # Support plot-per-graph fully matching Tensorboard behavior (i.e. if this is set to true, each series should have its own graph) tensorboard_single_series_per_graph: false } network { # Number of retries before failing to upload file file_upload_retries: 3 metrics { # Number of threads allocated to uploading files (typically debug images) when transmitting metrics for # a specific iteration file_upload_threads: 4 # Warn about upload starvation if no uploads were made in specified period while file-bearing events keep # being sent for upload file_upload_starvation_warning_sec: 120 } iteration { # Max number of retries when getting frames if the server returned an error (http code 500) max_retries_on_server_error: 5 # Backoff factory for consecutive retry attempts. # SDK will wait for {backoff factor} * (2 ^ ({number of total retries} - 1)) between retries. retry_backoff_factor_sec: 10 } } aws { s3 { # S3 credentials, used for read/write access by various SDK elements # The following settings will be used for any bucket not specified below in the "credentials" section # --------------------------------------------------------------------------------------------------- key: "***" secret: "***" region: "n-ws-hk0m2-pd11" # Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: true # Additional ExtraArgs passed to boto3 when uploading files. Can also be set per-bucket under "credentials". extra_args: { } # --------------------------------------------------------------------------------------------------- credentials: [ { # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket) host: "n-ws-hk0m2-pd11.s3pd11.sbercloud.ru:443" # Specify explicit keys bucket: "b-ws-hk0m2-pd11-r87" multipart: false secure: true } ] } boto3 { pool_connections: 512 max_multipart_concurrency: 16 multipart_threshold: 8388608 # 8MB multipart_chunksize: 8388608 # 8MB } } google.storage { # # Default project and credentials file # # Will be used when no bucket configuration is found # project: "clearml" # credentials_json: "/path/to/credentials.json" # pool_connections: 512 # pool_maxsize: 1024 # # Specific credentials per bucket and sub directory # credentials = [ # { # bucket: "my-bucket" # subdir: "path/in/bucket" # Not required # project: "clearml" # credentials_json: "/path/to/credentials.json" # }, # ] } azure.storage { # max_connections: 2 # containers: [ # { # account_name: "clearml" # account_key: "secret" # # container_name: # } # ] } log { # debugging feature: set this to true to make null log propagate messages to root logger (so they appear in stdout) null_log_propagate: false task_log_buffer_capacity: 66 # disable urllib info and lower levels disable_urllib3_info: true } development { # Development-mode options # dev task reuse window task_reuse_time_window_in_hours: 72.0 # Run VCS repository detection asynchronously vcs_repo_detect_async: true # Store uncommitted git/hg source code diff in experiment manifest when training in development mode # This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section store_uncommitted_code_diff: true # Support stopping an experiment in case it was externally stopped, status was changed or task was reset support_stopping: true # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead. default_output_uri: "https://n-ws-hk0m2-pd11.s3pd11.sbercloud.ru/b-ws-hk0m2-pd11-r87/test_clearml_s3_artifacts/" # Default auto generated requirements optimize for smaller requirements # If True, analyze the entire repository regardless of the entry point. # If False, first analyze the entry point script, if it does not contain other to local files, # do not analyze the entire repository. force_analyze_entire_repo: false # If set to true, *clearml* update message will not be printed to the console # this value can be overwritten with os environment variable CLEARML_SUPPRESS_UPDATE_MESSAGE=1 suppress_update_message: false # If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze` detect_with_pip_freeze: false # Log specific environment variables. OS environments are listed in the "Environment" section # of the Hyper-Parameters. # multiple selected variables are supported including the suffix '*'. # For example: "AWS_*" will log any OS environment variable starting with 'AWS_'. # This value can be overwritten with os environment variable CLEARML_LOG_ENVIRONMENT="[AWS_*, CUDA_VERSION]" # Example: log_os_environments: ["AWS_*", "CUDA_VERSION"] log_os_environments: [] # Development mode worker worker { # Status report period in seconds report_period_sec: 2 # The number of events to report report_event_flush_threshold: 100 # ping to the server - check connectivity ping_period_sec: 30 # Log all stdout & stderr log_stdout: true # Carriage return (\r) support. If zero (0) \r treated as \n and flushed to backend # Carriage return flush support in seconds, flush consecutive line feeds (\r) every X (default: 10) seconds console_cr_flush_period: 10 # compatibility feature, report memory usage for the entire machine # default (false), report only on the running process and its sub-processes report_global_mem_used: false # if provided, start resource reporting after this amount of seconds #report_start_sec: 30 } } # Apply top-level environment section from configuration into os.environ apply_environment: false # Top-level environment section is in the form of: # environment { # key: value # ... # } # and is applied to the OS environment as `key=value` for each key/value pair # Apply top-level files section from configuration into local file system apply_files: false # Top-level files section allows auto-generating files at designated paths with a predefined contents # and target format. Options include: # contents: the target file's content, typically a string (or any base type int/float/list/dict etc.) # format: a custom format for the contents. Currently supported value is `base64` to automatically decode a # base64-encoded contents string, otherwise ignored # path: the target file's path, may include ~ and inplace env vars # target_format: format used to encode contents before writing into the target file. Supported values are json, # yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode. # overwrite: overwrite the target file in case it exists. Default is true. # # Example: # files { # myfile1 { # contents: "The quick brown fox jumped over the lazy dog" # path: "/tmp/fox.txt" # } # myjsonfile { # contents: { # some { # nested { # value: [1, 2, 3, 4] # } # } # } # path: "/tmp/test.json" # target_format: json # } # } } ```

I am trying to log artifacts with this script:

```python import os from time import sleep import pandas as pd import numpy as np from PIL import Image from clearml import Task def main(): # Connecting ClearML with the current process, # from here on everything is logged automatically task = Task.init(project_name='test_cloud', task_name='upload_artifacts') df = pd.DataFrame( { 'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0], 'num_specimen_seen': [10, 2, 1, 8] }, index=['falcon', 'dog', 'spider', 'fish'] ) # Register Pandas object as artifact to watch # (it will be monitored in the background and automatically synced and uploaded) task.register_artifact('train', df, metadata={'counting': 'legs', 'max legs': 69}) # change the artifact object df.sample(frac=0.5, replace=True, random_state=1) # or access it from anywhere using the Task's get_registered_artifacts() Task.current_task().get_registered_artifacts()['train'].sample(frac=0.5, replace=True, random_state=1) # add and upload pandas.DataFrame (onetime snapshot of the object) task.upload_artifact('Pandas', artifact_object=df) # add and upload local file artifact task.upload_artifact('local file', artifact_object=os.path.join('data_samples', 'dancing.jpg')) # add and upload dictionary stored as JSON) task.upload_artifact('dictionary', df.to_dict()) # add and upload Numpy Object (stored as .npz file) task.upload_artifact('Numpy Eye', np.eye(100, 100)) # add and upload Image (stored as .png file) im = Image.open(os.path.join('data_samples', 'dancing.jpg')) task.upload_artifact('pillow_image', im) # add and upload a folder, artifact_object should be the folder path task.upload_artifact('local folder', artifact_object=os.path.join('data_samples')) # add and upload a wildcard task.upload_artifact('wildcard jpegs', artifact_object=os.path.join('data_samples', '*.jpg')) # do something here sleep(1.) print(df) # we are done print('Done') if __name__ == '__main__': main() ```

Here is the error I get:

``` ClearML Task: created new task id=13a09dd58e9d44c38be9481215ff692b 2023-11-03 10:31:14,587 - clearml.storage - **ERROR - Failed uploading: Could not connect to the endpoint URL: "https://b-ws-hk0m2-pd11-r87.s3.n-ws-hk0m2-pd11.amazonaws.com/.clearml.91adbccb-92b3-4946-a9e4-f4d9b8ba4ba4.test"** Traceback (most recent call last): File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn conn = connection.create_connection( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/util/connection.py", line 72, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/socket.py", line 953, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/httpsession.py", line 464, in send urllib_response = conn.urlopen( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connectionpool.py", line 799, in urlopen retries = retries.increment( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/util/retry.py", line 525, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/packages/six.py", line 770, in reraise raise value File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connectionpool.py", line 715, in urlopen httplib_response = self._make_request( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connectionpool.py", line 404, in _make_request self._validate_conn(conn) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn conn.connect() File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connection.py", line 363, in connect self.sock = conn = self._new_conn() File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: : Failed to establish a new connection: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/storage/helper.py", line 2741, in check_write_permissions self.delete(path=dest_path) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/storage/helper.py", line 2726, in delete return self._driver.delete_object(self.get_object(path)) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/storage/helper.py", line 599, in delete_object object.delete() File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/boto3/resources/factory.py", line 580, in do_action response = action(self, *args, **kwargs) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/boto3/resources/action.py", line 88, in __call__ response = getattr(parent.meta.client, operation_name)(*args, **params) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/client.py", line 535, in _api_call return self._make_api_call(operation_name, kwargs) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/client.py", line 963, in _make_api_call http, parsed_response = self._make_request( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/client.py", line 986, in _make_request return self._endpoint.make_request(operation_model, request_dict) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/endpoint.py", line 119, in make_request return self._send_request(request_dict, operation_model) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/endpoint.py", line 202, in _send_request while self._needs_retry( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/endpoint.py", line 354, in _needs_retry responses = self._event_emitter.emit( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/hooks.py", line 412, in emit return self._emitter.emit(aliased_event_name, **kwargs) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/hooks.py", line 256, in emit return self._emit(event_name, kwargs) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/hooks.py", line 239, in _emit response = handler(**kwargs) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 207, in __call__ if self._checker(**checker_kwargs): File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 284, in __call__ should_retry = self._should_retry( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 320, in _should_retry return self._checker(attempt_number, response, caught_exception) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 363, in __call__ checker_response = checker( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 247, in __call__ return self._check_caught_exception( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception raise caught_exception File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/endpoint.py", line 281, in _do_get_response http_response = self._send(request) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/endpoint.py", line 377, in _send return self.http_session.send(request) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/httpsession.py", line 493, in send raise EndpointConnectionError(endpoint_url=request.url, error=e) botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://b-ws-hk0m2-pd11-r87.s3.n-ws-hk0m2-pd11.amazonaws.com/.clearml.91adbccb-92b3-4946-a9e4-f4d9b8ba4ba4.test" During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/aimakhotka/Documents/github/sber/clearml_doc/src/files/artifacts.py", line 56, in main() File "/home/aimakhotka/Documents/github/sber/clearml_doc/src/files/artifacts.py", line 13, in main task = Task.init(project_name='test_cloud', task_name='upload_artifacts') File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/task.py", line 593, in init task.output_uri = task.get_project_object().default_output_destination File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/task.py", line 1124, in output_uri helper.check_write_permissions(value) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/storage/helper.py", line 2743, in check_write_permissions raise ValueError("Insufficient permissions (delete failed) for {}".format(base_url)) ValueError: Insufficient permissions (delete failed) for s3://b-ws-hk0m2-pd11-r87 ```

At the same time, if I explicitly specify ``output_uri='s3://b-ws-hk0m2-pd11-r87/b-ws-hk0m2-pd11-r87/``` as written in the FAQ, then the same error will occur - it will try to find the specified bucket in AWS:

Output

``` ClearML Task: created new task id=67eb2a0d513149258e6cd0c82861350b 2023-11-03 11:24:47,871 - clearml.storage - ERROR - Failed uploading: Could not connect to the endpoint URL: "https://b-ws-hk0m2-pd11-r87.s3.n-ws-hk0m2-pd11.amazonaws.com/b-ws-hk0m2-pd11-r87/test_clearml_s3_artifacts//.clearml.7047d8cd-43af-4b9b-a6fb-53cc6ab3a264.test" ```

But if I specify output_uri='https://b-ws-hk0m2-pd11-r87.b1.s3.sbercloud.ru:443/b-ws-hk0m2-pd11-r87/test_clearml_s3_artifacts/' in python file, it finally finds the right one.

Please can you tell me what I'm missing?(

jkhenning commented 8 months ago

Hi @aimakhotka, as you've mentioned, you'll indeed need to use the full service endpoint (i.e. https://b-ws-hk0m2-pd11-r87.b1.s3.sbercloud.ru:443/<bucket-name>/...) to specify your non-AWS service, as ClearML has no way to understand you're choosing a non-AWS service otherwise. To make your life easier, you can use the sdk.development.default_output_uri setting in your clearml.conf file instead of specifying this every time you call Task.init()

aimakhotka commented 8 months ago

Hi @aimakhotka, as you've mentioned, you'll indeed need to use the full service endpoint (i.e. https://b-ws-hk0m2-pd11-r87.b1.s3.sbercloud.ru:443/<bucket-name>/...) to specify your non-AWS service, as ClearML has no way to understand you're choosing a non-AWS service otherwise. To make your life easier, you can use the sdk.development.default_output_uri setting in your clearml.conf file instead of specifying this every time you call Task.init()

Hi @jkhenning, thanks you so much for your reply! The catch is that the method with specifying the sdk.development.default_output_uri parameter doesn't work. I specify the same address in clearml.conf, but it still doesn't work without specifying sdk.development.default_output_uri when calling Task.init(), although in theory everything should work. That's why I decided to ask for help(

jkhenning commented 8 months ago

Hi @aimakhotka,

I specify the same address in clearml.conf

Where do you specify it? in the sdk.aws.s3 section?

but it still doesn't work without specifying sdk.development.default_output_uri when calling Task.init(), although in theory everything should work

You should either provide it with sdk.development.default_output_uri or with Task.init(output_uri="https://b-ws-hk0m2-pd11-r87.b1.s3.sbercloud.ru:443/<bucket-name>/...") - are you saying using on e of these methods doesn't work?

aimakhotka commented 7 months ago

Hi @jkhenning,

Where do you specify it? in the sdk.aws.s3 section?

No, in the sdk.development.default_output_uri.

are you saying using on e of these methods doesn't work?

Yeah, that's exactly what I'm saying. I specify a sdk.development.default_output_uri, but it's like ClearML doesn't see this parameter in config. The method with Task.init() works, so the problem is not in the S3 path.

jkhenning commented 7 months ago

So you:

  1. Set sdk.development.default_output_uri in your clearml.conf file under with the value being https://...:433/bucket/...
  2. Run your python script locally (on the same machine) which uses Task.init() but specifies no output_uri

And the SDK does not use the default output_uri? Can you attach screenshots of how the task looks in the ClearML UI? Specifically the Execution and Info sections?

aimakhotka commented 7 months ago

Can you attach screenshots of how the task looks in the ClearML UI? Specifically the Execution and Info sections?

Yes, sure.

Here's what happens in the terminal

``` shell $ python3 artifacts.py ClearML Task: created new task id=7c73ee92bb4a4902b63bd2d7c9e88540 2023-11-10 14:29:52,039 - clearml.storage - ERROR - Failed uploading: Could not connect to the endpoint URL: "https://b-ws-hk0m2-pd11-r87.s3.n-ws-hk0m2-pd11.amazonaws.com/.clearml.2bc71333-b21a-48ee-b875-feb5f2372a15.test" Traceback (most recent call last): File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn conn = connection.create_connection( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/util/connection.py", line 72, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/socket.py", line 953, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/httpsession.py", line 464, in send urllib_response = conn.urlopen( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connectionpool.py", line 799, in urlopen retries = retries.increment( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/util/retry.py", line 525, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/packages/six.py", line 770, in reraise raise value File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connectionpool.py", line 715, in urlopen httplib_response = self._make_request( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connectionpool.py", line 404, in _make_request self._validate_conn(conn) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1058, in _validate_conn conn.connect() File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connection.py", line 363, in connect self.sock = conn = self._new_conn() File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: : Failed to establish a new connection: [Errno -2] Name or service not known During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/storage/helper.py", line 2741, in check_write_permissions self.delete(path=dest_path) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/storage/helper.py", line 2726, in delete return self._driver.delete_object(self.get_object(path)) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/storage/helper.py", line 599, in delete_object object.delete() File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/boto3/resources/factory.py", line 580, in do_action response = action(self, *args, **kwargs) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/boto3/resources/action.py", line 88, in __call__ response = getattr(parent.meta.client, operation_name)(*args, **params) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/client.py", line 535, in _api_call return self._make_api_call(operation_name, kwargs) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/client.py", line 963, in _make_api_call http, parsed_response = self._make_request( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/client.py", line 986, in _make_request return self._endpoint.make_request(operation_model, request_dict) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/endpoint.py", line 119, in make_request return self._send_request(request_dict, operation_model) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/endpoint.py", line 202, in _send_request while self._needs_retry( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/endpoint.py", line 354, in _needs_retry responses = self._event_emitter.emit( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/hooks.py", line 412, in emit return self._emitter.emit(aliased_event_name, **kwargs) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/hooks.py", line 256, in emit return self._emit(event_name, kwargs) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/hooks.py", line 239, in _emit response = handler(**kwargs) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 207, in __call__ if self._checker(**checker_kwargs): File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 284, in __call__ should_retry = self._should_retry( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 320, in _should_retry return self._checker(attempt_number, response, caught_exception) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 363, in __call__ checker_response = checker( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 247, in __call__ return self._check_caught_exception( File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception raise caught_exception File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/endpoint.py", line 281, in _do_get_response http_response = self._send(request) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/endpoint.py", line 377, in _send return self.http_session.send(request) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/botocore/httpsession.py", line 493, in send raise EndpointConnectionError(endpoint_url=request.url, error=e) botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://b-ws-hk0m2-pd11-r87.s3.n-ws-hk0m2-pd11.amazonaws.com/.clearml.2bc71333-b21a-48ee-b875-feb5f2372a15.test" During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/aimakhotka/Documents/github/sber/clearml_doc/src/files/artifacts.py", line 56, in main() File "/home/aimakhotka/Documents/github/sber/clearml_doc/src/files/artifacts.py", line 13, in main task = Task.init(project_name='test_cloud', task_name='jkhenning_test') File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/task.py", line 593, in init task.output_uri = task.get_project_object().default_output_destination File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/task.py", line 1124, in output_uri helper.check_write_permissions(value) File "/home/aimakhotka/.pyenv/versions/3.9.0/lib/python3.9/site-packages/clearml/storage/helper.py", line 2743, in check_write_permissions raise ValueError("Insufficient permissions (delete failed) for {}".format(base_url)) ValueError: Insufficient permissions (delete failed) for s3://b-ws-hk0m2-pd11-r87 ```

Screenshots of ClearML UI

![image](https://github.com/allegroai/clearml-server/assets/89968909/dc0ea59b-a322-4985-8c86-8b74b0e72d36) ![image](https://github.com/allegroai/clearml-server/assets/89968909/dd36ca6b-31cc-4757-a97c-8b651e38625e) ![image](https://github.com/allegroai/clearml-server/assets/89968909/1ba9c51a-643f-4dd2-88b5-b34cf6d0b41c) Configuration, Artifacts, Console, Scalar, Plots, Debug Samles are empty.

My clearml.conf

``` # ClearML SDK configuration file api { # Notice: 'host' is the api server (default port 8008), not the web server. api_server: http://localhost:8008 web_server: http://localhost:8080 files_server: https://http://localhost:8081/ # Credentials are generated using the webapp, http://62.113.97.251:8080/settings # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY credentials {"access_key": "***", "secret_key": "***"} } sdk { # ClearML - default SDK configuration storage { cache { # Defaults to /clearml_cache default_base_dir: "~/.clearml/cache" # default_cache_manager_size: 100 } direct_access: [ # Objects matching are considered to be available for direct access, i.e. they will not be downloaded # or cached, and any download request will return a direct reference. # Objects are specified in glob format, available for url and content_type. { url: "file://*" } # file-urls are always directly referenced ] } metrics { # History size for debug files per metric/variant. For each metric/variant combination with an attached file # (e.g. debug image event), file names for the uploaded files will be recycled in such a way that no more than # X files are stored in the upload destination for each metric/variant combination. file_history_size: 100 # Max history size for matplotlib imshow files per plot title. # File names for the uploaded images will be recycled in such a way that no more than # X images are stored in the upload destination for each matplotlib plot title. matplotlib_untitled_history_size: 100 # Limit the number of digits after the dot in plot reporting (reducing plot report size) # plot_max_num_digits: 5 # Settings for generated debug images images { format: JPEG quality: 87 subsampling: 0 } # Support plot-per-graph fully matching Tensorboard behavior (i.e. if this is set to true, each series should have its own graph) tensorboard_single_series_per_graph: false } network { # Number of retries before failing to upload file file_upload_retries: 3 metrics { # Number of threads allocated to uploading files (typically debug images) when transmitting metrics for # a specific iteration file_upload_threads: 4 # Warn about upload starvation if no uploads were made in specified period while file-bearing events keep # being sent for upload file_upload_starvation_warning_sec: 120 } iteration { # Max number of retries when getting frames if the server returned an error (http code 500) max_retries_on_server_error: 5 # Backoff factory for consecutive retry attempts. # SDK will wait for {backoff factor} * (2 ^ ({number of total retries} - 1)) between retries. retry_backoff_factor_sec: 10 } } aws { s3 { # S3 credentials, used for read/write access by various SDK elements # The following settings will be used for any bucket not specified below in the "credentials" section # --------------------------------------------------------------------------------------------------- key: "***" secret: "***" region: "n-ws-hk0m2-pd11" # Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: true # Additional ExtraArgs passed to boto3 when uploading files. Can also be set per-bucket under "credentials". extra_args: { } # --------------------------------------------------------------------------------------------------- credentials: [ { # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket) host: "n-ws-hk0m2-pd11.s3pd11.sbercloud.ru:443" # Specify explicit keys bucket: "b-ws-hk0m2-pd11-r87" multipart: false secure: true } ] } boto3 { pool_connections: 512 max_multipart_concurrency: 16 multipart_threshold: 8388608 # 8MB multipart_chunksize: 8388608 # 8MB } } google.storage { # # Default project and credentials file # # Will be used when no bucket configuration is found # project: "clearml" # credentials_json: "/path/to/credentials.json" # pool_connections: 512 # pool_maxsize: 1024 # # Specific credentials per bucket and sub directory # credentials = [ # { # bucket: "my-bucket" # subdir: "path/in/bucket" # Not required # project: "clearml" # credentials_json: "/path/to/credentials.json" # }, # ] } azure.storage { # max_connections: 2 # containers: [ # { # account_name: "clearml" # account_key: "secret" # # container_name: # } # ] } log { # debugging feature: set this to true to make null log propagate messages to root logger (so they appear in stdout) null_log_propagate: false task_log_buffer_capacity: 66 # disable urllib info and lower levels disable_urllib3_info: true } development { # Development-mode options # dev task reuse window task_reuse_time_window_in_hours: 72.0 # Run VCS repository detection asynchronously vcs_repo_detect_async: true # Store uncommitted git/hg source code diff in experiment manifest when training in development mode # This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section store_uncommitted_code_diff: true # Support stopping an experiment in case it was externally stopped, status was changed or task was reset support_stopping: true # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead. default_output_uri: "https://n-ws-hk0m2-pd11.s3pd11.sbercloud.ru:443/b-ws-hk0m2-pd11-r87/test_clearml_s3_artifacts/" # Default auto generated requirements optimize for smaller requirements # If True, analyze the entire repository regardless of the entry point. # If False, first analyze the entry point script, if it does not contain other to local files, # do not analyze the entire repository. force_analyze_entire_repo: false # If set to true, *clearml* update message will not be printed to the console # this value can be overwritten with os environment variable CLEARML_SUPPRESS_UPDATE_MESSAGE=1 suppress_update_message: false # If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze` detect_with_pip_freeze: false # Log specific environment variables. OS environments are listed in the "Environment" section # of the Hyper-Parameters. # multiple selected variables are supported including the suffix '*'. # For example: "AWS_*" will log any OS environment variable starting with 'AWS_'. # This value can be overwritten with os environment variable CLEARML_LOG_ENVIRONMENT="[AWS_*, CUDA_VERSION]" # Example: log_os_environments: ["AWS_*", "CUDA_VERSION"] log_os_environments: [] # Development mode worker worker { # Status report period in seconds report_period_sec: 2 # The number of events to report report_event_flush_threshold: 100 # ping to the server - check connectivity ping_period_sec: 30 # Log all stdout & stderr log_stdout: true # Carriage return (\r) support. If zero (0) \r treated as \n and flushed to backend # Carriage return flush support in seconds, flush consecutive line feeds (\r) every X (default: 10) seconds console_cr_flush_period: 10 # compatibility feature, report memory usage for the entire machine # default (false), report only on the running process and its sub-processes report_global_mem_used: false # if provided, start resource reporting after this amount of seconds #report_start_sec: 30 } } # Apply top-level environment section from configuration into os.environ apply_environment: false # Top-level environment section is in the form of: # environment { # key: value # ... # } # and is applied to the OS environment as `key=value` for each key/value pair # Apply top-level files section from configuration into local file system apply_files: false # Top-level files section allows auto-generating files at designated paths with a predefined contents # and target format. Options include: # contents: the target file's content, typically a string (or any base type int/float/list/dict etc.) # format: a custom format for the contents. Currently supported value is `base64` to automatically decode a # base64-encoded contents string, otherwise ignored # path: the target file's path, may include ~ and inplace env vars # target_format: format used to encode contents before writing into the target file. Supported values are json, # yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode. # overwrite: overwrite the target file in case it exists. Default is true. # # Example: # files { # myfile1 { # contents: "The quick brown fox jumped over the lazy dog" # path: "/tmp/fox.txt" # } # myjsonfile { # contents: { # some { # nested { # value: [1, 2, 3, 4] # } # } # } # path: "/tmp/test.json" # target_format: json # } # } } ```

jkhenning commented 7 months ago

You specifying both the global sdk.aws.s3 settings:

            key: "***"
            secret: "***"
            region: "n-ws-hk0m2-pd11"
            use_credentials_chain: true

As well as the bucket-specific:

            credentials: [
                {
                     #  This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
                     host: "n-ws-hk0m2-pd11.s3pd11.sbercloud.ru:443"
                     # Specify explicit keys
             bucket: "b-ws-hk0m2-pd11-r87"
                     multipart: false
                     secure: true
                 }
                ]

You should only specify the bucket-specific one and not use the credentials chain, can you please try it out?

aimakhotka commented 7 months ago

You should only specify the bucket-specific one and not use the credentials chain, can you please try it out?

That helped! But default_output_uri had to be specified in the format s3://...:443/bucket-name/.... Thank you so much! Why is this happening? What does the use_credentials_chain parameter do?

jkhenning commented 7 months ago

This parameter basically tells boto3 to look for credentials in the system's configuration or in an AWS role (in case it's running on an AWS machine) and not use the explicitly provided credentials

aimakhotka commented 7 months ago

This parameter basically tells boto3 to look for credentials in the system's configuration or in an AWS role (in case it's running on an AWS machine) and not use the explicitly provided credentials

Ooh, I see. In all the examples I saw, this parameter was "true" and I didn't really understand the description in the documentation, so I didn't even think about it. Thanks, you really helped me out!