Closed traviss64 closed 4 months ago
I managed to make the script processFiles.py work by converting the JSONL files to JSON
But running the script gives errors:
Traceback (most recent call last):
File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
yield len(line), json.loads(line)
^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 19 (char 18)
Error parsing line: "url": "https://www.reddit.com/r/pushshift/comments/1ddyx6a/a_help_regarding_jsonjsonl_files/",
Traceback (most recent call last):
File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
yield len(line), json.loads(line)
^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 10 (char 9)
Error parsing line: "user_reports": [],
Traceback (most recent call last):
File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
yield len(line), json.loads(line)
^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 19 (char 18)
Error parsing line: "view_count": null,
Traceback (most recent call last):
File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
yield len(line), json.loads(line)
^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 17 (char 16)
Error parsing line: "visited": false,
Traceback (most recent call last):
File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
yield len(line), json.loads(line)
^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 14 (char 13)
Error parsing line: "whitelist_status": "all_ads",
Traceback (most recent call last):
File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
yield len(line), json.loads(line)
^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 23 (char 22)
Error parsing line: "wls": 6
In this line change .json
to .jsonl
. Then it should work. I should correct that soon.
In this line change
.json
to.jsonl
. Then it should work. I should correct that soon.
Ok that worked.
"Add your code to the processRow function"
What does the readme mean by this last point? This function is in processFiles.py but what to do next?
Am extremely sorry if this seems very basic but this is my first time with JSON/L files.
For clarification, the .jsonl file extension means, that one line of text in the file is one json object. One json object is one post or comment, depending on the file you are reading.
What you do with that json object depends on what your actual goal is. Examples of fields you can use:
username = row["author"]
subreddit = row["subreddit"]
# comments only
commentText = row["body"]
# posts only
postTitle = row["title"]
postText = row["selftext"]
postUrl = row["url"]
In general I would recommend to just take a look at the data you are dealing with and see what kind of fields there are. Open a jsonl file in a text editor, copy one line, format the json object (ide or online tool) and understand what you are dealing with.
Okay thank you so much
As the title says
I tried a few tools including MS VS Code but the JSONL files show error. Is there any way to open them?