delip / PyTorchNLPBook

Code and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media https://amzn.to/3JUgR2L
Apache License 2.0
1.98k stars 807 forks source link

YELP raw_train.csv file no longer available on Google Drive, please provide alternate source #38

Open richlysakowski opened 1 year ago

richlysakowski commented 1 year ago

raw_train.csv

https://drive.google.com/open?id=1xeUnqkhuzGGzZKThzPeXe2Vf6Uu_g_xM gives a 404 error

Please provide update link to exact dataset used in the book, or to an entirely new set of yelp CSV-formatted datasets (train, test, and reviews_with_splits_lite)

ajhergenroeder commented 1 year ago

@richlysakowski -- I had the same problem. I think this one on Yelp is identical -- that's what I'm going to use. https://www.kaggle.com/datasets/ilhamfp31/yelp-review-dataset

photomz commented 1 year ago

@richlysakowski Here's what worked for me running on Jupyter notebook (Google Colab, June 2023). First, have ~/.kaggle/kaggle.json with 600 permissions.

from pathlib import Path

creds = 'your JSON credentials from Kaggle.com'
cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

Then, download directly from Kaggle API:

from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()

dataset_slug = 'ilhamfp31/yelp-review-dataset'
api.dataset_download_files(dataset_slug, unzip=True)

You may have to rename a few files and folders:

mkdir data
mkdir data/yelp
mv yelp_review_polarity_csv/* data/yelp/
mv data/yelp/test.csv data/yelp/raw_test.csv
mv data/yelp/train.csv data/yelp/raw_train.csv
rm -r yelp_review_polarity_csv/

You should be able to run the rest of the Yelp notebooks as per normal.