Kaggle / kaggle-api

Official Kaggle API
Apache License 2.0
6.2k stars 1.09k forks source link

Download straight to S3 #315

Closed domteo95 closed 1 year ago

domteo95 commented 3 years ago

How can I download straight to s3?

I'm currently doing something like kaggle datasets download -d 'stefanoleone992/fifa-21-complete-player-dataset' | aws s3 cp - s3://s3_bucket/data.zip but the data.zip file that appears in my bucket is a broken file

whereas if i specify the name of the zip file that I'm trying to cp as seen below, it'll first download to my local machine and then copy it to S3 which i'm hoping to avoid the initial download to my local machine. kaggle datasets download -d 'stefanoleone992/fifa-21-complete-player-dataset' | aws s3 cp fifa-21-complete-player-dataset.zip s3://s3_bucket/data.zip

Thanks!

filiptronicek commented 3 years ago

I'm guessing there is no way to do that, because if it's not downloading first to your machine, then Kaggle's servers need to handle the upload, which they don't.

AdityaSoni19031997 commented 3 years ago

You can do that by using boto3/AWS cli and setting the correct auths! So what we can do is use Colab for this. First let the download of that file complete and then one can shift it to AWS. If your dataset is small enough, you can do this on Kaggle kernels as well IMO easily...

rholowczak commented 2 years ago

If you are willing/brave enough to edit some Python source code in the Kaggle API then you can do the following:

1) Install the Kaggle Command Line Interface (CLI). Do this as your regular user (not as root) $ pip3 install kaggle Add your Access Token to the .kaggle/kaggle.json file.

2) Edit the kaggle_api_extended.py and make two changes (shown below) to accommodate writing the file to standard output. nano ~/.local/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py

Use the "Go To" key in Nano: ^_ to go to line 1582 in the file. Change line 1582 from: if not os.path.exists(outpath): to if not os.path.exists(outpath) and outpath != "-":

Change line 1594 From: with open(outfile, 'wb') as out: to with open(outfile, 'wb') if outpath != "-" else os.fdopen(sys.stdout.fileno(), 'wb', closefd=False) as out:

Save the file and exit the text editor

3) Pick a Kaggle data set you will be working with (For example I chose: totoro29/air-pollution-level You will need the author and data set name.

4) Use the kaggle datasets download command to fetch the data set and send the file to standard output Pipe this output to the aws s3 cp command and direct it to your S3 bucket. You also need to give it a file name. For example: kaggle datasets download --quiet -d totoro29/air-pollution-level -p - | aws s3 cp - s3://project-data-rh/polution.zip

5) Check your S3 bucket to see if the file was downloaded aws s3 ls s3://project-data-rh

tshadat2002 commented 2 years ago

If you are willing/brave enough to edit some Python source code in the Kaggle API then you can do the following:

  1. Install the Kaggle Command Line Interface (CLI). Do this as your regular user (not as root) $ pip3 install kaggle Add your Access Token to the .kaggle/kaggle.json file.
  2. Edit the kaggle_api_extended.py and make two changes (shown below) to accommodate writing the file to standard output. nano ~/.local/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py

Use the "Go To" key in Nano: ^_ to go to line 1582 in the file. Change line 1582 from: if not os.path.exists(outpath): to if not os.path.exists(outpath) and outpath != "-":

Change line 1594 From: with open(outfile, 'wb') as out: to with open(outfile, 'wb') if outpath != "-" else os.fdopen(sys.stdout.fileno(), 'wb', closefd=False) as out:

Save the file and exit the text editor

  1. Pick a Kaggle data set you will be working with (For example I chose: totoro29/air-pollution-level You will need the author and data set name.
  2. Use the kaggle datasets download command to fetch the data set and send the file to standard output Pipe this output to the aws s3 cp command and direct it to your S3 bucket. You also need to give it a file name. For example: kaggle datasets download --quiet -d totoro29/air-pollution-level -p - | aws s3 cp - s3://project-data-rh/polution.zip
  3. Check your S3 bucket to see if the file was downloaded aws s3 ls s3://project-data-rh

So given we do the above following steps, how do we unzip the files within an S3 bucket? I'm assuming we can't really use a zip file at all without unzipping it

rholowczak commented 2 years ago

So given we do the above following steps, how do we unzip the files within an S3 bucket? I'm assuming we can't really use a zip file at all without unzipping it

I've been using the zipfile module in Python combined with Boto3 to access the zip file and pick out the parts I need. https://docs.python.org/3/library/zipfile.html

Also I think the kaggle datasets download command has an option to unzip the contents during the download. I did not test that yet. it might get complicated if the ZIP archive contains more than one file. Then the issue would be which file to send to standard output for the aws s3 cp command to pick up.

rholowczak commented 2 years ago

I've been using the zipfile module in Python combined with Boto3 to access the zip file and pick out the parts I need. https://docs.python.org/3/library/zipfile.html

Adapted from https://bongtechblogger.hashnode.dev/how-to-extract-and-manipulate-all-the-zip-files-stored-in-a-folder-of-an-amazon-s3-bucket-part-1-the-extraction

import zipfile
import boto3
from io import BytesIO
bucket="project-data-XX"   # Put your bucket name here
zipfile_to_unzip="archive.zip"   # Put the name of your ZIp file here
s3_client = boto3.client('s3', use_ssl=False)
s3_resource = boto3.resource('s3')

zip_obj = s3_resource.Object(bucket_name=bucket, key=zipfile_to_unzip)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():    # Loop through all of the files contained in the Zip archive
    print('Working on ' + filename)
    # Unzip the file and write it back to S3 in the same bucket
    s3_resource.meta.client.upload_fileobj(z.open(filename),Bucket=bucket,Key=f'{filename}')
Philmod commented 1 year ago

We recommend you to build your own pipeline to achieve this goal.