Kaggle / kaggle-api

Official Kaggle API
Apache License 2.0
6.16k stars 1.08k forks source link

Kaggle CLI Listing and Downloading files where archive has a space in the folder name #500

Closed rholowczak closed 1 year ago

rholowczak commented 1 year ago

Kaggle CLI has issues when working with datasets that have nested folders with spaces in the folder names. One example is this dataset: viktoriiashkurenko/278k-spotify-songs

We can use the Kaggle CLI to get a list of files in the dataset:

$ kaggle datasets files viktoriiashkurenko/278k-spotify-songs
name                                size  creationDate
--------------------------------  ------  -------------------
artists.csv                          6MB  2023-05-18 17:11:45
music_genres.txt                     3KB  2023-05-18 17:11:45
final_tracks.csv                    61MB  2023-05-18 17:11:45
im_getting_these_vibes_uknow.txt     2KB  2023-05-18 17:11:45
main_dataset.csv                   115MB  2023-05-18 17:11:45
final_playlists.csv               1000KB  2023-05-18 17:11:45

However, this list omits the nested folder: "Cleaned Analyses"

It does not seem possible to list the files in that folder. Possibly this is due to the space in the name of the folder:

kaggle datasets files viktoriiashkurenko/278k-spotify-songs/Cleaned Analyses
usage: kaggle [-h] [-v] {competitions,c,datasets,d,kernels,k,models,m,files,f,config} ...
kaggle: error: unrecognized arguments: Analyses

Enclosing the path in single or double quotes does not help. Also trying the escape the space or replace it with an HTML encoded space (%20) does not seem to work. This is on Windows command shell if that makes a difference:

kaggle datasets files "viktoriiashkurenko/278k-spotify-songs/Cleaned Analyses"
400 - Bad Request - Invalid datasetVersionNumber value

We can extract one file from a dataset by specifying the "-f" option:

kaggle datasets download -d viktoriiashkurenko/278k-spotify-songs -f artists.csv

It seems we can put quotes around the full file path to extract individual files:

kaggle datasets download -d viktoriiashkurenko/278k-spotify-songs -f "Cleaned Analyses/Cleaned Analyses/000CbwTZICdj6uprlrc1f1.pickle"

Perhaps I am just missing some obvious tricks or command-line options. Please let me know if you have any suggestions.

Thanks

rholowczak commented 1 year ago

I closed this issue as I believe the title is misleading. The real issue is that currently one can not get a complete list of all files in the dataset including those in nested folders.