Open tiechengsu opened 8 years ago
@tiechengsu i am having the same problem, were you able to solve the issue?
@bngksgl No, I used the previous dataset instead, which you can find here https://app.dominodatalab.com/mtldata/yackathon/browse/yelp_dataset_challenge_academic_dataset It's easier to import. The latest data combine several categories together, no idea have to import it.
It's a .tar file, just decompress it again
The latest data combine several categories together, no idea have to import it.
Does that mean reviews.josn and business.json,etc. are mixed stored int he file?
Not really sure where you ppl r facing errors. I have edited the code to accept .json files explicitly and convert them to .csv. I have mentioned the filepath in main method explicitly instead of using arg.parse as in original code. Let me know if this helps.
"""Convert the Yelp Dataset Challenge dataset from json format to csv.
import argparse
import collections
import csv
import json
def read_and_write_file(json_file_path, csv_file_path, column_names):
"""Read in the json dataset file and write it out to a csv file, given the column names."""
with open(csv_file_path, 'w') as fout:
csv_file = csv.writer(fout)
csv_file.writerow(list(column_names))
with open(json_file_path, encoding = 'utf8') as fin:
for line in fin:
line_contents = json.loads(line)
csv_file.writerow(get_row(line_contents, column_names))
def get_superset_of_column_names_from_file(json_file_path):
"""Read in the json dataset file and return the superset of column names."""
column_names = set()
with open(json_file_path, encoding = 'utf8') as fin:
for line in fin:
line_contents = json.loads(line)
column_names.update(
set(get_column_names(line_contents).keys())
)
return column_names
def get_column_names(line_contents, parent_key=''):
"""Return a list of flattened key names given a dict.
Example:
line_contents = {
'a': {
'b': 2,
'c': 3,
},
}
will return: ['a.b', 'a.c']
These will be the column names for the eventual csv file.
"""
column_names = []
for k, v in line_contents.items():
column_name = "{0}.{1}".format(parent_key, k) if parent_key else k
if isinstance(v, collections.MutableMapping):
column_names.extend(
get_column_names(v, column_name).items()
)
else:
column_names.append((column_name, v))
return dict(column_names)
def get_nested_value(d, key):
"""Return a dictionary item given a dictionary d
and a flattened key from get_column_names
.
Example:
d = {
'a': {
'b': 2,
'c': 3,
},
}
key = 'a.b'
will return: 2
"""
if '.' not in key:
if key not in d:
return None
return d[key]
base_key, sub_key = key.split('.', 1)
if base_key not in d:
return None
sub_dict = d[base_key]
return get_nested_value(sub_dict, sub_key)
def get_row(line_contents, column_names): """Return a csv compatible row given column names and a dict.""" row = [] for column_name in column_names: line_value = get_nested_value( line_contents, column_name, ) if isinstance(line_value, str): row.append('{0}'.format(line_value.encode('utf-8'))) elif line_value is not None: row.append('{0}'.format(line_value)) else: row.append('') return row if(name == 'main'): """Convert a yelp dataset file from json to csv.""" json_file = [] json_file.append('D:\YELP Dataset\yelp_academic_dataset_business.json'); #args.json_file json_file.append('D:\YELP Dataset\yelp_academic_dataset_checkin.json'); json_file.append('D:\YELP Dataset\yelp_academic_dataset_review.json'); json_file.append('D:\YELP Dataset\yelp_academic_dataset_tip.json'); json_file.append('D:\YELP Dataset\yelp_academic_dataset_user.json'); csv_file = [] for i in range(5): csv_file.append('{}.csv'.format((json_file[i])[0:len(json_file[i])-5])) column_names = get_superset_of_column_names_from_file(json_file[i]) read_and_write_file(json_file[i], csv_file[i], column_names) print('{} converted to {} successfully.'.format(json_file[i], csv_file[i]))
YES! SOLVED! Once you have decomposed it from *.tar, do it again on the generated file, then you will see different josn files.
@CAVIND46016 are you able to post your code in a formatted snippet? Using it in my compiler is producing indentation errors. Thank you!
@dotdose : Have a look at the code here, this should work better. https://github.com/CAVIND46016/Yelp-Reviews-Dataset-Analysis/blob/master/json_to_csv_converter.py
with open('yelp_dataset_challenge_academic_dataset',encoding='utf-8') as f: jsondata=json.load(f) I try to import the dataset into python with the code above, but failed. The error is that 'utf-8' codec can't decode byte 0xb5. I also try encoding='charmap', but it didn't work either. Can anyone tell me how to import the data.