Kaggle / kaggle-api

Official Kaggle API
Apache License 2.0
6.19k stars 1.09k forks source link

404 when downloading single file of a large dataset #295

Open Jeosas opened 4 years ago

Jeosas commented 4 years ago

Hi,

I'm trying to download only one file of the jorijnsmit/binance-full-history dataset. (it's a large dataset containing ~1000 files, don't need all of them)

First of all, when lissing files with kaggle d files jorijnsmit/binance-full-history, I only get:

name                   size  creationDate         
--------------------  -----  -------------------  
AE-BNB.parquet         11MB  2020-08-22 23:52:30  
AGI-ETH.parquet        15MB  2020-08-22 23:52:30  
ADA-BNB.parquet        22MB  2020-08-22 23:52:30  
AE-BTC.parquet         27MB  2020-08-22 23:52:30  
ADA-TUSD.parquet       11MB  2020-08-22 23:52:30  
ADAUP-USDT.parquet    944KB  2020-08-22 23:52:30  
ADA-BTC.parquet        37MB  2020-08-22 23:52:30  
ADA-BKRW.parquet      563KB  2020-08-22 23:52:30  
ADA-PAX.parquet         5MB  2020-08-22 23:52:30  
AGI-BTC.parquet        21MB  2020-08-22 23:52:30  
ADADOWN-USDT.parquet  868KB  2020-08-22 23:52:30  
ADA-ETH.parquet        34MB  2020-08-22 23:52:30  
AE-ETH.parquet         20MB  2020-08-22 23:52:30  
ADX-ETH.parquet        20MB  2020-08-22 23:52:30  
AGI-BNB.parquet         9MB  2020-08-22 23:52:30  
ADX-BTC.parquet        28MB  2020-08-22 23:52:30  
ADA-BUSD.parquet        6MB  2020-08-22 23:52:30  
ADA-USDT.parquet       38MB  2020-08-22 23:52:30  
ADA-USDC.parquet        9MB  2020-08-22 23:52:30  
ADX-BNB.parquet        13MB  2020-08-22 23:52:30

And trying to download a file:

kaggle d download -p /data/test -f AGI-ETH.parquet jorijnsmit/binance-full-history works perfectly when kaggle d download -p /data/test -f AION-BNB.parquet jorijnsmit/binance-full-history returns 404 - Not Found even if AION-BNB.parquet exists in the dataset

NOTE that if I kaggle d download -p /data/test jorijnsmit/binance-full-history everything works great, and AION-BNB.parquet is downloaded with the rest of the dataset (but I don't want to download 12Gigs each time i wish to update 5-10 files..)

Any ideas ?

Infos: python v3.8 kaggle v1.5.6 (tried downgrading to v1.5.3, same issue)

milotoor commented 3 years ago

Hi @Jeosas, were you ever to able to work around this? I'm having the exact same problem with a different dataset:

% kaggle datasets files its7171/hpa-mask

name                                                      size  creationDate         
--------------------------------------------------------  ----  -------------------  
hpa_nuclei_mask/00301238-bbb2-11e8-b2ba-ac1f6b6435d0.npz  28KB  2021-01-31 05:11:49  
hpa_nuclei_mask/004bf4c6-bbc6-11e8-b2bc-ac1f6b6435d0.npz  14KB  2021-01-31 05:11:49  
hpa_nuclei_mask/00456fd2-bb9b-11e8-b2b9-ac1f6b6435d0.npz  29KB  2021-01-31 05:11:49  
hpa_nuclei_mask/0042017c-bba4-11e8-b2b9-ac1f6b6435d0.npz  18KB  2021-01-31 05:11:49  
hpa_nuclei_mask/00383b44-bbbb-11e8-b2ba-ac1f6b6435d0.npz  22KB  2021-01-31 05:11:49  
hpa_nuclei_mask/000a6c98-bb9b-11e8-b2b9-ac1f6b6435d0.npz  24KB  2021-01-31 05:11:49  
hpa_nuclei_mask/000a9596-bbc4-11e8-b2bc-ac1f6b6435d0.npz  18KB  2021-01-31 05:11:49  
hpa_nuclei_mask/00285ce4-bba0-11e8-b2b9-ac1f6b6435d0.npz  25KB  2021-01-31 05:11:49  
hpa_nuclei_mask/0032a07e-bba9-11e8-b2ba-ac1f6b6435d0.npz  28KB  2021-01-31 05:11:49  
hpa_nuclei_mask/00481c70-bba3-11e8-b2b9-ac1f6b6435d0.npz  22KB  2021-01-31 05:11:49  
hpa_nuclei_mask/0020af02-bbba-11e8-b2ba-ac1f6b6435d0.npz  34KB  2021-01-31 05:11:49  
hpa_nuclei_mask/003feb6e-bbca-11e8-b2bc-ac1f6b6435d0.npz  33KB  2021-01-31 05:11:49  
hpa_nuclei_mask/0047c984-bba6-11e8-b2ba-ac1f6b6435d0.npz  40KB  2021-01-31 05:11:49  
hpa_nuclei_mask/002ff91e-bbb8-11e8-b2ba-ac1f6b6435d0.npz  36KB  2021-01-31 05:11:49  
hpa_nuclei_mask/004a2b84-bbc4-11e8-b2bc-ac1f6b6435d0.npz  32KB  2021-01-31 05:11:49  
hpa_nuclei_mask/0038d6a6-bb9a-11e8-b2b9-ac1f6b6435d0.npz  42KB  2021-01-31 05:11:49  
hpa_nuclei_mask/002679c2-bbb6-11e8-b2ba-ac1f6b6435d0.npz  23KB  2021-01-31 05:11:49  
hpa_nuclei_mask/004b47de-bbca-11e8-b2bc-ac1f6b6435d0.npz  31KB  2021-01-31 05:11:49  
hpa_nuclei_mask/000c99ba-bba4-11e8-b2b9-ac1f6b6435d0.npz  32KB  2021-01-31 05:11:49  
hpa_nuclei_mask/001838f8-bbca-11e8-b2bc-ac1f6b6435d0.npz  36KB  2021-01-31 05:11:49  
hpa_cell_mask/00301238-bbb2-11e8-b2ba-ac1f6b6435d0.npz    58KB  2021-01-31 05:11:49  
hpa_cell_mask/004bf4c6-bbc6-11e8-b2bc-ac1f6b6435d0.npz    31KB  2021-01-31 05:11:49  
hpa_cell_mask/00456fd2-bb9b-11e8-b2b9-ac1f6b6435d0.npz    55KB  2021-01-31 05:11:49  
hpa_cell_mask/0042017c-bba4-11e8-b2b9-ac1f6b6435d0.npz    30KB  2021-01-31 05:11:49  
hpa_cell_mask/00383b44-bbbb-11e8-b2ba-ac1f6b6435d0.npz    55KB  2021-01-31 05:11:49  
hpa_cell_mask/000a6c98-bb9b-11e8-b2b9-ac1f6b6435d0.npz    49KB  2021-01-31 05:11:49  
hpa_cell_mask/000a9596-bbc4-11e8-b2bc-ac1f6b6435d0.npz    35KB  2021-01-31 05:11:49  
hpa_cell_mask/00285ce4-bba0-11e8-b2b9-ac1f6b6435d0.npz    46KB  2021-01-31 05:11:49  
hpa_cell_mask/0032a07e-bba9-11e8-b2ba-ac1f6b6435d0.npz    61KB  2021-01-31 05:11:49  
hpa_cell_mask/00481c70-bba3-11e8-b2b9-ac1f6b6435d0.npz    45KB  2021-01-31 05:11:49  
hpa_cell_mask/0020af02-bbba-11e8-b2ba-ac1f6b6435d0.npz    60KB  2021-01-31 05:11:49  
hpa_cell_mask/003feb6e-bbca-11e8-b2bc-ac1f6b6435d0.npz    63KB  2021-01-31 05:11:49  
hpa_cell_mask/0047c984-bba6-11e8-b2ba-ac1f6b6435d0.npz    72KB  2021-01-31 05:11:49  
hpa_cell_mask/002ff91e-bbb8-11e8-b2ba-ac1f6b6435d0.npz    58KB  2021-01-31 05:11:49  
hpa_cell_mask/004a2b84-bbc4-11e8-b2bc-ac1f6b6435d0.npz    67KB  2021-01-31 05:11:49  
hpa_cell_mask/0038d6a6-bb9a-11e8-b2b9-ac1f6b6435d0.npz    72KB  2021-01-31 05:11:49  
hpa_cell_mask/002679c2-bbb6-11e8-b2ba-ac1f6b6435d0.npz    39KB  2021-01-31 05:11:49  
hpa_cell_mask/004b47de-bbca-11e8-b2bc-ac1f6b6435d0.npz    57KB  2021-01-31 05:11:49  
hpa_cell_mask/000c99ba-bba4-11e8-b2b9-ac1f6b6435d0.npz    59KB  2021-01-31 05:11:49  
hpa_cell_mask/001838f8-bbca-11e8-b2bc-ac1f6b6435d0.npz    64KB  2021-01-31 05:11:49

In fact there are 43,000 files, almost all of which are inaccessible through the API. Any file in that listing above I can retrieve with, e.g. kaggle d download -d its7171/hpa-mask --file hpa_cell_mask/000a6c98-bb9b-11e8-b2b9-ac1f6b6435d0.npz, but every file outside of the listing returns a 404.