JuliaML / MLDatasets.jl

Utility package for accessing common Machine Learning datasets in Julia
https://juliaml.github.io/MLDatasets.jl/stable
MIT License
228 stars 47 forks source link

Feature request: Kaggle dataset support #214

Open NeroBlackstone opened 1 year ago

NeroBlackstone commented 1 year ago

Kaggle supplies many datasets, most are in CSV format.

Does adding the feature of directly downloading Kaggle datasets in MLDatasets.jl make any sense?

For example, to download House Prices 2023 Dataset:

Step1: Get kaggle.json file or set the username and key manually.

username = "neroblackstone"
key = "key"

or download keggle.json to ~/.kaggle/

Step2: Download

# download dataset to default path and extract csv.
files_path = keggle_download("howisusmanali/house-prices-2023-dataset")

Step3: Processing

using CSV
using DataFrames

file_path = joinpath(files_path,"csv_we_want.csv")
data = CSV.read(open(file_path),DataFrame)

Implementation:

What's your thought, do you think this feature makes sense? I can implement this by myself and make a PR.

CarloLucibello commented 1 year ago

This would be great to have! We have to go with the rest api though, so far we managed to avoid the pycall dependency.

NeroBlackstone commented 1 year ago

I'm trying to implement the complete keggle api in Julia.

Since Python's kaggle api is generated using the openapi specification. I also want to use openapi.jl to generate julia kaggle api.

However, openapi.jl does not have full support for file downloads. If this feature implemented, I will continue working on kaggle.jl.