Overview

Summary

The Forest, Tons of Metadata, and Eye Candy

The Forest

Structured comments gets a long-overdue upgrade - comments of all levels are now included in scrape files! Users can export an entire submission to JSON in this format. Inspired by PRAW's CommentForest naming convention, I have created a new Forest class and CommentNode object to make this possible.

I made some changes to the submission comments scraper's UI. URS can now export all comments in either structured or raw format.

Structured comments format is now the default - I believe it is more useful than the raw format. Subsequently, I have added a --raw flag for users who would prefer to export to the raw format instead.

Tons of Metadata

I have added more metadata for the Subreddit, submission comments, and especially the Redditor scraper. A full list of new attributes are listed in the List All Changes That Have Been Made section as well as the README.

Eye Candy

Additionally I am adding more eye candy. Halo has been implemented to spice up the output - it is now more informative yet maintains a minimalist style, especially the Redditor scraper's output.

This pull request will deviate a bit from the standard template. I will add a section for the new source code because I want to explain how it works.

Motivation/Context

I am a self-taught software developer who just recently graduated from college and am currently looking for my first full-time job. I do not have a computer science degree, so I have had to teach myself a ton of concepts that I would have learned if I got the degree. A class I wish I was able to take in college is data structures and algorithms because that seems to be all the buzz when it comes to the technical interview, which I unfortunately struggle with greatly due to my lack of experience and practice.

Recently I have been teaching myself DSA. Implementing simple examples of each topic within DSA was not so bad (I am currently working on a study guide/reference repository containing these implementations in both Python and Rust that I will make public soon), but practicing Leetcode problems was and still is a difficult process for me. I will continue to power through the struggle because my livelihood and future career depends on it, though.

While it has not been a smooth journey, I have come to realize how useful DSA is and am implementing what I have learned in a real-world use case. I do not think I would have been able to figure out a solution to the structured comments scraper's prior shortcomings if I had not studied this area within computer science. I recently implemented my first trie and was fascinated by how abstract data structures worked. I immediately realized I needed to use a tree data structure for the structured comments scraper in order to take it to the next level, which is the purpose of this pull request.

How the Forest Works

I will strip docstring comments from the source code to keep it relatively short.

The `CommentNode`

I created a class CommentNode to store each comment's metadata and replies:

class CommentNode():
    def __init__(self, metadata):
        for key, value in metadata.items():
            self.__setattr__(key, value)

        self.replies = []

I used __setattr__() because the root node defers from the standard comment node schema. By using __setattr__(), CommentNode attributes will be dynamically set based on the metadata dictionary that has been passed in. self.replies holds additional CommentNodes.

The `Forest`

Next, I created a class Forest which holds the root node and includes methods for insertion.

The Root Node

First, let's go over the root node.

class Forest():
    def __init__(self):
        self.root = CommentNode({ "id": "abc123" })

The only key in the dictionary passed into CommentNode is id, therefore the root CommentNode will only contain the attributes self.id and self.replies. A mock submission ID is shown. The actual source code will pull the submission's ID based on the URL that was passed into the -c flag and set the id value accordingly.

Before I get to the insertion methods, I will explain how comments and their replies are linked.

How PRAW Comments Are Linked

PRAW returns all submission comments by level order. This means all top levels are returned first, followed by all second-level replies, then third, so on and so forth.

I will create some mock comment objects to demonstrate. Here is a top level comment corresponding to the mock submisssion ID. Note the parent_id contains the submission's id, which is stored in self.root.id:

{
    "author": "u/asdfasdfasdfasdf",
    "body": "A top level comment here.",
    "created_utc": "06-06-2006 06:06:06",
    "distinguished": null,
    "edited": false,
    "id": "qwerty1",
    "is_submitter": false,
    "link_id": "t3_asdfgh",
    "parent_id": "t3_abc123",
    "score": 666,
    "stickied": false
}

Here is a second-level reply to the top comment. Note the parent_id contains the top comment's id:

{
    "author": "u/hjklhjklhjklhjkl",
    "body": "A reply here.",
    "created_utc": "06-06-2006 18:06:06",
    "distinguished": null,
    "edited": false,
    "id": "hjkl234",
    "is_submitter": true,
    "link_id": "t3_1a2b3c",
    "parent_id": "t1_qwerty1",
    "score": 6,
    "stickied": false
}

This pattern continues all the way down to the last level of comments. It is now very easy to link the correct comments together. I do this by calling split("_", 1) on the parent_id and then getting the second item in the split list to compare values. I also specify the maxsplit parameter to force one split.

The Insertion Methods

I then defined the methods for CommentNode insertion.

    def _dfs_insert(self, new_comment):
        stack = []
        stack.append(self.root)

        visited = set()
        visited.add(self.root)

        found = False
        while not found:
            current_comment = stack.pop(0)

            for reply in current_comment.replies:
                if new_comment.parent_id.split("_", 1)[1] == reply.id:
                    reply.replies.append(new_comment)
                    found = True
                else:
                    if reply not in visited:
                        stack.insert(0, reply)
                        visited.add(reply)

    def seed(self, new_comment):
        parent_id = new_comment.parent_id.split("_", 1)[1]

        self.root.replies.append(new_comment) \
            if parent_id == getattr(self.root, "id") \
            else self._dfs_insert(new_comment)

I implemented the depth-first search algorithm to find a comment's parent node and insert it into the parent node's replies array. I defined a separate visited set to keep track of visited CommentNodes to avoid an infinite loop of inserting CommentNodes that were already visited into the stack. At first I wrote a recursive version of depth-first search, but then opted for an iterative version because it would not scale well for submissions that included large amounts of comments, ie. stack overflow.

Within the seed method, I first check if the CommentNode is a top level comment by comparing its parent ID to the submission ID. Depth-first search is triggered if the CommentNode is not a top level comment.

Serializing the Forest

Since Python's built-in JSON module can only handle primitive types that have a direct JSON equivalent, a custom encoder is necessary to convert the Forest into JSON format. I defined this in Export.py.

from json import JSONEncoder

class EncodeNode(JSONEncoder):
    def default(self, object):
        return object.__dict__

The default() method overrides JSONEncoder's default() method and serializes the CommentNode by converting it into a dictionary, which is a primitive type that has a direct JSON equivalent:

EncodeNode().encode(CommentNode)

This ensures the node is correctly encoded before I call the seed() method to insert a new CommentNode into the replies arrays of its respective parent CommentNode.

I can then use this custom JSONEncoder subclass while exporting by specifying it within json.dump() with the cls kwarg:

with open(filename, "w", encoding = "utf-8") as results:
    json.dump(data, results, indent = 4, cls = EncodeNode)

This was how the structured comments export was implemented. I will also paste this short walkthrough into a new markdown file in the docs directory for my own future reference (and for whoever else is curious). Refer to the source code located in urs/praw_scrapers/Comments.py to see more. I hope this was somewhat interesting and/or informative!

New Dependencies

astroid==2.5.1
attrs==20.3.0
certifi==2020.12.5
chardet==4.0.0
colorama==0.4.4
coverage==5.5
halo==0.0.31
idna==2.10
iniconfig==1.1.1
isort==5.8.0
kiwisolver==1.3.1
lazy-object-proxy==1.5.2
log-symbols==0.0.14
more-itertools==8.7.0
numpy==1.20.1
packaging==20.9
Pillow==8.1.2
praw==7.2.0
prawcore==2.0.0
prettytable==2.1.0
py==1.10.0
pylint==2.7.2
pytest==6.2.2
pytest-cov==2.11.1
requests==2.25.1
six==1.15.0
spinners==0.0.24
termcolor==1.1.0
toml==0.10.2
urllib3==1.26.4
wcwidth==0.2.5
websocket-client==0.58.0

Issue Fix or Enhancement Request

Fulfills Richard Larrabee's enhancement request made over email - requested adding the author to Subreddit submission metadata.

Type of Change

[x] Code Refactor
[x] New Feature (non-breaking change which adds functionality)
[x] This change requires a documentation update

Breaking Change

N/A

List All Changes That Have Been Made

Added

User interface
- Added Halo to spice up the output while maintaining minimalism.
Source code
- Created a comment Forest and accompanying CommentNode.
  - The Forest contains methods for inserting CommentNodes, including a depth-first search algorithm to do so.
- Subreddit.py has been refactored and submission metadata has been added to scrape files:
  - "author"
  - "created_utc"
  - "distinguished"
  - "edited"
  - "id"
  - "is_original_content"
  - "is_self"
  - "link_flair_text"
  - "locked"
  - "name"
  - "num_comments"
  - "nsfw"
  - "permalink"
  - "score"
  - "selftext"
  - "spoiler"
  - "stickied"
  - "title"
  - "upvote_ratio"
  - "url"
- Comments.py has been refactored and submission comments now include the following metadata:
  - "author"
  - "body"
  - "body_html"
  - "created_utc"
  - "distinguished"
  - "edited"
  - "id"
  - "is_submitter"
  - "link_id"
  - "parent_id"
  - "score"
  - "stickied"
- Major refactor for Redditor.py on top of adding additional metadata.
  - Additional Redditor information has been added to scrape files:
    - "has_verified_email"
    - "icon_img"
    - "subreddit"
    - "trophies"
  - Additional Redditor comment, submission, and multireddit metadata has been added to scrape files:
    - subreddit objects are nested within comment and submission objects and contain the following metadata:
      - "can_assign_link_flair"
      - "can_assign_user_flair"
      - "created_utc"
      - "description"
      - "description_html"
      - "display_name"
      - "id"
      - "name"
      - "nsfw"
      - "public_description"
      - "spoilers_enabled"
      - "subscribers"
      - "user_is_banned"
      - "user_is_moderator"
      - "user_is_subscriber"
    - comment objects will contain the following metadata:
      - "type"
      - "body"
      - "body_html"
      - "created_utc"
      - "distinguished"
      - "edited"
      - "id"
      - "is_submitter"
      - "link_id"
      - "parent_id"
      - "score"
      - "stickied"
      - "submission" - contains additional metadata
      - "subreddit_id"
    - submission objects will contain the following metadata:
      - "type"
      - "author"
      - "created_utc"
      - "distinguished"
      - "edited"
      - "id"
      - "is_original_content"
      - "is_self"
      - "link_flair_text"
      - "locked"
      - "name"
      - "num_comments"
      - "nsfw"
      - "permalink"
      - "score"
      - "selftext"
      - "spoiler"
      - "stickied"
      - "subreddit" - contains additional metadata
      - "title"
      - "upvote_ratio"
      - "url"
    - multireddit objects will contain the following metadata:
      - "can_edit"
      - "copied_from"
      - "created_utc"
      - "description_html"
      - "description_md"
      - "display_name"
      - "name"
      - "nsfw"
      - "subreddits"
      - "visibility"
  - interactions are now sorted in alphabetical order.
- CLI
  - Flags
    - --raw - Export comments in raw format instead (structure format is the default)
README
- Added new bullet point for The Forest Markdown file.
Tests
- Added a new test for the Status class in Global.py.
Repository documents
- Added "The Forest".
  - This Markdown file is just a place where I describe how I implemented the Forest.

Changed

User interface
- Submission comments scraping parameters have changed due to the improvements made in this pull request.
  - Structured comments is now the default format.
    - Users will have to include the new --raw flag to export to raw format.
  - Both structured and raw formats can now scrape all comments from a submission.
Source code
- The submission comments JSON file's structure has been modified to fit the new submission_metadata dictionary. "data" is now a dictionary that contains the submission metadata dictionary and scraped comments list. Comments are now stored in the "comments" field within "data".
- Exporting Redditor or submission comments is now forbidden.
  - URS will ignore the --csv flag if it is present while trying to use either scraper.
- The created_utc field for each Subreddit rule is now converted to readable time.
- requirements.txt has been updated.
  - As of v1.20.0, numpy has dropped support for Python 3.6, which means Python 3.7+ is required for URS.
    - .travis.yml has been modified to exclude Python 3.6. Added Python 3.9 to test configuration.
    - Note: Older versions of Python can still be used by downgrading to numpy<=1.19.5.
- Reddit object validation block has been refactored.
  - A new reusable module has been defined at the bottom of Validation.py.
README
- Updated the Comments section to reflect new changes to comments scraper UI.
Tests
- Updated CLI usage and examples tests.
- Updated c_fname() test because submission comments scrapes now follow a different naming convention.

Deprecated

User interface
- Specifying 0 comments does not only export all comments to raw format anymore. Defaults to structured format.
Source code
- Deprecated many global variables defined in Global.py:
  - eo
  - options
  - s_t
  - analytical_tools

How Has This Been Tested?

Ran pytest on local machine - all tests have passed.
Travis CI tests for Python versions 3.7, 3.8, and 3.9 have passed.
Ran manual tests.

Test Configuration

Python version: 3.9.2
See .travis.yml for full test configuration.

Dependencies

astroid==2.5.1
attrs==20.3.0
certifi==2020.12.5
chardet==4.0.0
colorama==0.4.4
coverage==5.5
cycler==0.10.0
halo==0.0.31
idna==2.10
iniconfig==1.1.1
isort==5.8.0
kiwisolver==1.3.1
lazy-object-proxy==1.5.2
log-symbols==0.0.14
matplotlib==3.3.4
mccabe==0.6.1
more-itertools==8.7.0
numpy==1.20.1
packaging==20.9
Pillow==8.1.2
pluggy==0.13.1
praw==7.2.0
prawcore==2.0.0
prettytable==2.1.0
py==1.10.0
pylint==2.7.2
pyparsing==2.4.7
pytest==6.2.2
pytest-cov==2.11.1
python-dateutil==2.8.1
requests==2.25.1
six==1.15.0
spinners==0.0.24
termcolor==1.1.0
toml==0.10.2
update-checker==0.18.0
urllib3==1.26.4
wcwidth==0.2.5
websocket-client==0.58.0
wordcloud==1.8.1
wrapt==1.12.1

Checklist

[x] My code follows the style guidelines of this project.
[x] I have performed a self-review of my own code, including testing to ensure my fix is effective or that my feature works.
[x] My changes generate no new warnings.
[x] I have commented my code, providing a summary of the functionality of each method, particularly in areas that may be hard to understand.
[x] I have made corresponding changes to the documentation.
[x] I have performed a self-review of this Pull Request template, ensuring the Markdown file renders correctly.

JosephLai241 / URS

URS v3.2.1 | The Forest, Tons of Metadata, and Eye Candy #24

Overview

Summary

The Forest, Tons of Metadata, and Eye Candy

The Forest

Tons of Metadata

Eye Candy

Motivation/Context

How the Forest Works

The `CommentNode`

The `Forest`

Serializing the Forest

New Dependencies

Issue Fix or Enhancement Request

Type of Change

Breaking Change

List All Changes That Have Been Made

Added

Changed

Deprecated

How Has This Been Tested?

Test Configuration

Dependencies

Checklist

JosephLai241 / URS

URS v3.2.1 | The Forest, Tons of Metadata, and Eye Candy #24

Overview

Summary

The Forest, Tons of Metadata, and Eye Candy

The Forest

Tons of Metadata

Eye Candy

Motivation/Context

How the Forest Works

The CommentNode

The Forest

Serializing the Forest

New Dependencies

Issue Fix or Enhancement Request

Type of Change

Breaking Change

List All Changes That Have Been Made

Added

Changed

Deprecated

How Has This Been Tested?

Test Configuration

Dependencies

Checklist

The `CommentNode`

The `Forest`