ArthurHeitmann / arctic_shift

Making Reddit data accessible to researchers, moderators and everyone else. Interact with the data through large dumps, an API or web interface.
https://arctic-shift.photon-reddit.com
269 stars 21 forks source link

How to open JSONL files downloaded directly from the website #20

Closed traviss64 closed 4 months ago

traviss64 commented 4 months ago

As the title says

I tried a few tools including MS VS Code but the JSONL files show error. Is there any way to open them?

traviss64 commented 4 months ago

I managed to make the script processFiles.py work by converting the JSONL files to JSON

But running the script gives errors:

Traceback (most recent call last):
  File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
    yield len(line), json.loads(line)
                     ^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 19 (char 18)
Error parsing line:     "url": "https://www.reddit.com/r/pushshift/comments/1ddyx6a/a_help_regarding_jsonjsonl_files/",

Traceback (most recent call last):
  File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
    yield len(line), json.loads(line)
                     ^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 10 (char 9)
Error parsing line:     "user_reports": [],

Traceback (most recent call last):
  File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
    yield len(line), json.loads(line)
                     ^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 19 (char 18)
Error parsing line:     "view_count": null,

Traceback (most recent call last):
  File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
    yield len(line), json.loads(line)
                     ^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 17 (char 16)
Error parsing line:     "visited": false,

Traceback (most recent call last):
  File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
    yield len(line), json.loads(line)
                     ^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 14 (char 13)
Error parsing line:     "whitelist_status": "all_ads",

Traceback (most recent call last):
  File "D:\Software\arctic_shift\scripts\fileStreams.py", line 58, in getJsonFileJsonStream
    yield len(line), json.loads(line)
                     ^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 23 (char 22)
Error parsing line:     "wls": 6
ArthurHeitmann commented 4 months ago

In this line change .json to .jsonl. Then it should work. I should correct that soon.

traviss64 commented 4 months ago

In this line change .json to .jsonl. Then it should work. I should correct that soon.

WindowsTerminal_Wco5kFlHLV

Ok that worked.

"Add your code to the processRow function"

What does the readme mean by this last point? This function is in processFiles.py but what to do next?

Am extremely sorry if this seems very basic but this is my first time with JSON/L files.

ArthurHeitmann commented 4 months ago

For clarification, the .jsonl file extension means, that one line of text in the file is one json object. One json object is one post or comment, depending on the file you are reading.

What you do with that json object depends on what your actual goal is. Examples of fields you can use:

username = row["author"]
subreddit = row["subreddit"]
# comments only
commentText = row["body"]
# posts only
postTitle = row["title"]
postText = row["selftext"]
postUrl = row["url"]

In general I would recommend to just take a look at the data you are dealing with and see what kind of fields there are. Open a jsonl file in a text editor, copy one line, format the json object (ide or online tool) and understand what you are dealing with.

traviss64 commented 4 months ago

Okay thank you so much