dbogdanov / ismir2017-discogs

Examples of analysis of editorial metadata from the Discogs database
https://dbogdanov.github.io/ismir2017-discogs
7 stars 3 forks source link

ValueError: Unterminated string starting at: line 1 column 1175 (char 1174) #1

Open loretoparisi opened 6 years ago

loretoparisi commented 6 years ago

Hello, I get this error when running the preprocess_releases_json_to_hdf_pandas.py

Loading json dump into a pandas DataFrame
Processed 500000 releases
Processed 1000000 releases
Processed 1500000 releases
Processed 2000000 releases
Processed 2500000 releases
Processed 3000000 releases
Processed 3500000 releases
Processed 4000000 releases
Processed 4500000 releases
Processed 5000000 releases
Processed 5500000 releases
Processed 6000000 releases
Processed 6500000 releases
Processed 7000000 releases
Processed 7500000 releases
Processed 8000000 releases
Processed 8500000 releases
Processed 9000000 releases
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-4df24afa67c8> in <module>()
----> 1 from preprocess_releases_json_to_hdf_pandas.py import *

/Users/loretoparisi/Documents/Projects/AI/ismir2017-discogs/code/preprocess_releases_json_to_hdf_pandas.py in <module>()
    134 else:
    135     print("Loading json dump into a pandas DataFrame")
--> 136     data = load_releases(ignore_genres=IGNORE_GENRES, part=100)
    137     print("Saving DataFrame to %s" % dump_pandas)
    138     data.to_hdf(dump_pandas, 'w')

/Users/loretoparisi/Documents/Projects/AI/ismir2017-discogs/code/preprocess_releases_json_to_hdf_pandas.py in load_releases(size, part, ignore_genres)
     69             if not i % (100/part):
     70 
---> 71                 release = json.loads(jsonline)
     72 
     73                 # remove some columns that we won't use to save memory

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    336             parse_int is None and parse_float is None and
    337             parse_constant is None and object_pairs_hook is None and not kw):
--> 338         return _default_decoder.decode(s)
    339     if cls is None:
    340         cls = JSONDecoder

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
    364 
    365         """
--> 366         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    367         end = _w(s, end).end()
    368         if end != len(s):

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
    380         """
    381         try:
--> 382             obj, end = self.scan_once(s, idx)
    383         except StopIteration:
    384             raise ValueError("No JSON object could be decoded")

ValueError: Unterminated string starting at: line 1 column 1175 (char 1174)

I have updated the data to 2018 releases here https://github.com/loretoparisi/ismir2017-discogs/blob/master/code/config.py Everything worked properly, so in my data/ folder I have

ip-192-168-22-127:discogs loretoparisi$ tree -L 1 -h
.
├── [239M]  discogs_20180101_artists.xml.gz
├── [ 39M]  discogs_20180101_labels.xml.gz
├── [152M]  discogs_20180101_masters.xml.gz
├── [9.0G]  discogs_20180101_releases.json.dump
└── [5.1G]  discogs_20180101_releases.xml.gz

0 directories, 5 files
dbogdanov commented 6 years ago

Hi @loretoparisi, I'll have a look and try this new dump next week.