Yelp / dataset-examples

Samples for users of the Yelp Academic Dataset
http://www.yelp.com/academic_dataset
Other
1.23k stars 615 forks source link

Can not import the dataset into python #20

Open tiechengsu opened 8 years ago

tiechengsu commented 8 years ago

with open('yelp_dataset_challenge_academic_dataset',encoding='utf-8') as f: jsondata=json.load(f) I try to import the dataset into python with the code above, but failed. The error is that 'utf-8' codec can't decode byte 0xb5. I also try encoding='charmap', but it didn't work either. Can anyone tell me how to import the data.

bngksgl commented 8 years ago

@tiechengsu i am having the same problem, were you able to solve the issue?

tiechengsu commented 8 years ago

@bngksgl No, I used the previous dataset instead, which you can find here https://app.dominodatalab.com/mtldata/yackathon/browse/yelp_dataset_challenge_academic_dataset It's easier to import. The latest data combine several categories together, no idea have to import it.

Hank-JSJ commented 8 years ago

It's a .tar file, just decompress it again

HongxuChenUQ commented 7 years ago

The latest data combine several categories together, no idea have to import it.

Does that mean reviews.josn and business.json,etc. are mixed stored int he file?

CAVIND46016 commented 7 years ago

Not really sure where you ppl r facing errors. I have edited the code to accept .json files explicitly and convert them to .csv. I have mentioned the filepath in main method explicitly instead of using arg.parse as in original code. Let me know if this helps.

Reference:

https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py

"""Convert the Yelp Dataset Challenge dataset from json format to csv. import argparse import collections import csv import json def read_and_write_file(json_file_path, csv_file_path, column_names): """Read in the json dataset file and write it out to a csv file, given the column names.""" with open(csv_file_path, 'w') as fout: csv_file = csv.writer(fout) csv_file.writerow(list(column_names)) with open(json_file_path, encoding = 'utf8') as fin: for line in fin: line_contents = json.loads(line) csv_file.writerow(get_row(line_contents, column_names)) def get_superset_of_column_names_from_file(json_file_path): """Read in the json dataset file and return the superset of column names.""" column_names = set() with open(json_file_path, encoding = 'utf8') as fin: for line in fin: line_contents = json.loads(line) column_names.update( set(get_column_names(line_contents).keys()) ) return column_names def get_column_names(line_contents, parent_key=''): """Return a list of flattened key names given a dict. Example: line_contents = { 'a': { 'b': 2, 'c': 3, }, } will return: ['a.b', 'a.c'] These will be the column names for the eventual csv file. """ column_names = [] for k, v in line_contents.items(): column_name = "{0}.{1}".format(parent_key, k) if parent_key else k if isinstance(v, collections.MutableMapping): column_names.extend( get_column_names(v, column_name).items() ) else: column_names.append((column_name, v)) return dict(column_names) def get_nested_value(d, key): """Return a dictionary item given a dictionary d and a flattened key from get_column_names.

Example:
    d = {
        'a': {
            'b': 2,
            'c': 3,
            },
    }
    key = 'a.b'
    will return: 2

"""
if '.' not in key:
    if key not in d:
        return None
    return d[key]
base_key, sub_key = key.split('.', 1)
if base_key not in d:
    return None
sub_dict = d[base_key]
return get_nested_value(sub_dict, sub_key)

def get_row(line_contents, column_names): """Return a csv compatible row given column names and a dict.""" row = [] for column_name in column_names: line_value = get_nested_value( line_contents, column_name, ) if isinstance(line_value, str): row.append('{0}'.format(line_value.encode('utf-8'))) elif line_value is not None: row.append('{0}'.format(line_value)) else: row.append('') return row if(name == 'main'): """Convert a yelp dataset file from json to csv.""" json_file = [] json_file.append('D:\YELP Dataset\yelp_academic_dataset_business.json'); #args.json_file json_file.append('D:\YELP Dataset\yelp_academic_dataset_checkin.json'); json_file.append('D:\YELP Dataset\yelp_academic_dataset_review.json'); json_file.append('D:\YELP Dataset\yelp_academic_dataset_tip.json'); json_file.append('D:\YELP Dataset\yelp_academic_dataset_user.json'); csv_file = [] for i in range(5): csv_file.append('{}.csv'.format((json_file[i])[0:len(json_file[i])-5])) column_names = get_superset_of_column_names_from_file(json_file[i]) read_and_write_file(json_file[i], csv_file[i], column_names) print('{} converted to {} successfully.'.format(json_file[i], csv_file[i]))

HongxuChenUQ commented 7 years ago

YES! SOLVED! Once you have decomposed it from *.tar, do it again on the generated file, then you will see different josn files.

tootrackminded commented 7 years ago

@CAVIND46016 are you able to post your code in a formatted snippet? Using it in my compiler is producing indentation errors. Thank you!

CAVIND46016 commented 7 years ago

@dotdose : Have a look at the code here, this should work better. https://github.com/CAVIND46016/Yelp-Reviews-Dataset-Analysis/blob/master/json_to_csv_converter.py