amazon-science / esci-data

Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search
https://amazonkddcup.github.io
Apache License 2.0
248 stars 54 forks source link

Read Parquet file failed #11

Closed zhiyuanpeng closed 2 years ago

zhiyuanpeng commented 2 years ago

@franbvalero Hi Fran, Thanks for making this dataset public. I'd like to do some experiments on this esci-data, however, I find I can't read the parquet files.

I installed the requirement.txt in my python3.8.10 env:

Package               Version
--------------------- ---------
aicrowd-cli           0.1.15
aiohttp               3.8.1
aiosignal             1.2.0
async-timeout         4.0.2
attrs                 22.1.0
certifi               2022.6.15
charset-normalizer    2.1.1
click                 7.1.2
colorama              0.4.5
commonmark            0.9.1
datasets              1.13.3
dill                  0.3.5.1
filelock              3.8.0
frozenlist            1.3.1
fsspec                2022.8.2
gitdb                 4.0.9
GitPython             3.1.18
huggingface-hub       0.0.19
idna                  3.3
joblib                1.1.0
multidict             6.0.2
multiprocess          0.70.13
nltk                  3.7
numpy                 1.23.2
packaging             21.3
pandas                1.1.5
Pillow                9.2.0
pip                   21.1.1
pyarrow               2.0.0
Pygments              2.13.0
pyparsing             3.0.9
python-dateutil       2.8.2
python-slugify        5.0.2
pytz                  2022.2.1
PyYAML                6.0
pyzmq                 22.1.0
regex                 2022.8.17
requests              2.28.1
requests-toolbelt     0.9.1
rich                  10.16.2
sacremoses            0.0.53
scikit-learn          0.24.1
scipy                 1.9.1
semver                2.13.0
sentence-transformers 2.1.0
sentencepiece         0.1.97
setuptools            56.0.0
six                   1.16.0
smmap                 5.0.0
text-unidecode        1.3
threadpoolctl         3.1.0
tokenizers            0.10.3
toml                  0.10.2
torch                 1.12.1
torchvision           0.13.1
tqdm                  4.64.1
transformers          4.11.0
typing-extensions     4.3.0
urllib3               1.26.12
xxhash                3.0.0
yarl                  1.8.1

I failed to read parquet file with error:

pyarrow.lib.ArrowInvalid: Could not open Parquet input source esci-data/shopping_queries_dataset/shopping_queries_dataset_examples.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Do you know how to solve this problem? Thanks

franbvalero commented 2 years ago

Hi @zhiyuanpeng,

Thank you for your interest in the dataset.

This happened to me once too. You can clone again the repository: since it could happen that it does not download correctly; or you can try to download manually the corresponding parquet files:

Let me know if this solve your problem.

zhiyuanpeng commented 2 years ago

@franbvalero Thanks for your quick reply. I read parquet files successfully on another server. The error on my side may be caused by the VPN. I will try it again later. Thanks!