Lightning-AI / litdata

Streamline data pipelines for AI. Process datasets across 1000s of machines, and optimize data for blazing fast model training.
Apache License 2.0
249 stars 24 forks source link

LitData doesn't support s3 bucket connection outside server #183

Open sanyalsunny111 opened 5 days ago

sanyalsunny111 commented 5 days ago

🚀 Feature

LitData should support s3 bucket connection for streaming data outside of the same server.

Motivation

Current LitData supports s3 bucket connection for within public prod server but not outside of that for instance a GCP server.

Additional context

Sebastian and Adrian motivated me to raise this issue.

github-actions[bot] commented 5 days ago

Hi! thanks for your contribution!, great first issue!

tchaton commented 5 days ago

Hey @sanyalsunny111,

I am not sure I fully understand the issue.

rasbt commented 5 days ago

Could you provide the concrete code snippets and file paths (and studio names) to illustrate this to @tchaton with a concrete example to follow @sanyalsunny111

sanyalsunny111 commented 5 days ago

acknowledged I will do it shortly.

sanyalsunny111 commented 3 days ago

@tchaton So, some dataset is uploaded to a publicly accessible s3 bucket and also in data prep of some teamspace. Now that I have tried to access this data using studio's public prod profile. However when I am trying to use the same data using s3 (yes I have configured through aws cli) or teamspace I couldn't access it. Below it a screenshot where it is asking for an access key.

image
tchaton commented 3 days ago

Hey @sanyalsunny111. Can you share a reproducible script ?

sanyalsunny111 commented 3 days ago

Sure @tchaton I am using litgpt w/ no changes. Here is a loom video I recorded https://www.loom.com/share/5b55bc4c23e3403ea3257cdf34ceab2e?sid=761c670b-d52d-465e-bafe-d86be5d239cb

tchaton commented 3 days ago

Hey @sanyalsunny111 Any Studio I can duplicate ?

sanyalsunny111 commented 2 days ago

here /thunder/Experiments-Sunny2024

sanyalsunny111 commented 2 days ago

@tchaton Luca made some modifications and for me it is working fine now. Thought of updating you. He changed below mentioned lines in /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/client.py

if has_shared_credentials_file or not _IS_IN_STUDIO or True:
            self._client = boto3.client(
                "s3", config=botocore.config.Config(retries={"max_attempts": 1000, "mode": "adaptive"}, signature_version=botocore.UNSIGNED)
            ) 
tchaton commented 1 day ago

Hey @sanyalsunny111. Can you make a PR with the fix ?