Structured comments gets a long-overdue upgrade - comments of all levels are now included in scrape files! Users can export an entire submission to JSON in this format. Inspired by PRAW's CommentForest naming convention, I have created a new Forest class and CommentNode object to make this possible.
I made some changes to the submission comments scraper's UI. URS can now export all comments in either structured or raw format.
Structured comments format is now the default - I believe it is more useful than the raw format. Subsequently, I have added a --raw flag for users who would prefer to export to the raw format instead.
Tons of Metadata
I have added more metadata for the Subreddit, submission comments, and especially the Redditor scraper. A full list of new attributes are listed in the List All Changes That Have Been Made section as well as the README.
Eye Candy
Additionally I am adding more eye candy. Halo has been implemented to spice up the output - it is now more informative yet maintains a minimalist style, especially the Redditor scraper's output.
This pull request will deviate a bit from the standard template. I will add a section for the new source code because I want to explain how it works.
Motivation/Context
I am a self-taught software developer who just recently graduated from college and am currently looking for my first full-time job. I do not have a computer science degree, so I have had to teach myself a ton of concepts that I would have learned if I got the degree. A class I wish I was able to take in college is data structures and algorithms because that seems to be all the buzz when it comes to the technical interview, which I unfortunately struggle with greatly due to my lack of experience and practice.
Recently I have been teaching myself DSA. Implementing simple examples of each topic within DSA was not so bad (I am currently working on a study guide/reference repository containing these implementations in both Python and Rust that I will make public soon), but practicing Leetcode problems was and still is a difficult process for me. I will continue to power through the struggle because my livelihood and future career depends on it, though.
While it has not been a smooth journey, I have come to realize how useful DSA is and am implementing what I have learned in a real-world use case. I do not think I would have been able to figure out a solution to the structured comments scraper's prior shortcomings if I had not studied this area within computer science. I recently implemented my first trie and was fascinated by how abstract data structures worked. I immediately realized I needed to use a tree data structure for the structured comments scraper in order to take it to the next level, which is the purpose of this pull request.
How the Forest Works
I will strip docstring comments from the source code to keep it relatively short.
The CommentNode
I created a class CommentNode to store each comment's metadata and replies:
class CommentNode():
def __init__(self, metadata):
for key, value in metadata.items():
self.__setattr__(key, value)
self.replies = []
I used __setattr__() because the root node defers from the standard comment node schema. By using __setattr__(), CommentNode attributes will be dynamically set based on the metadata dictionary that has been passed in. self.replies holds additional CommentNodes.
The Forest
Next, I created a class Forest which holds the root node and includes methods for insertion.
The Root Node
First, let's go over the root node.
class Forest():
def __init__(self):
self.root = CommentNode({ "id": "abc123" })
The only key in the dictionary passed into CommentNode is id, therefore the root CommentNode will only contain the attributes self.id and self.replies. A mock submission ID is shown. The actual source code will pull the submission's ID based on the URL that was passed into the -c flag and set the id value accordingly.
Before I get to the insertion methods, I will explain how comments and their replies are linked.
How PRAW Comments Are Linked
PRAW returns all submission comments by level order. This means all top levels are returned first, followed by all second-level replies, then third, so on and so forth.
I will create some mock comment objects to demonstrate. Here is a top level comment corresponding to the mock submisssion ID. Note the parent_id contains the submission's id, which is stored in self.root.id:
This pattern continues all the way down to the last level of comments. It is now very easy to link the correct comments together. I do this by calling split("_", 1) on the parent_id and then getting the second item in the split list to compare values. I also specify the maxsplit parameter to force one split.
The Insertion Methods
I then defined the methods for CommentNode insertion.
def _dfs_insert(self, new_comment):
stack = []
stack.append(self.root)
visited = set()
visited.add(self.root)
found = False
while not found:
current_comment = stack.pop(0)
for reply in current_comment.replies:
if new_comment.parent_id.split("_", 1)[1] == reply.id:
reply.replies.append(new_comment)
found = True
else:
if reply not in visited:
stack.insert(0, reply)
visited.add(reply)
def seed(self, new_comment):
parent_id = new_comment.parent_id.split("_", 1)[1]
self.root.replies.append(new_comment) \
if parent_id == getattr(self.root, "id") \
else self._dfs_insert(new_comment)
I implemented the depth-first search algorithm to find a comment's parent node and insert it into the parent node's replies array. I defined a separate visited set to keep track of visited CommentNodes to avoid an infinite loop of inserting CommentNodes that were already visited into the stack. At first I wrote a recursive version of depth-first search, but then opted for an iterative version because it would not scale well for submissions that included large amounts of comments, ie. stack overflow.
Within the seed method, I first check if the CommentNode is a top level comment by comparing its parent ID to the submission ID. Depth-first search is triggered if the CommentNode is not a top level comment.
Serializing the Forest
Since Python's built-in JSON module can only handle primitive types that have a direct JSON equivalent, a custom encoder is necessary to convert the Forest into JSON format. I defined this in Export.py.
from json import JSONEncoder
class EncodeNode(JSONEncoder):
def default(self, object):
return object.__dict__
The default() method overrides JSONEncoder's default() method and serializes the CommentNode by converting it into a dictionary, which is a primitive type that has a direct JSON equivalent:
EncodeNode().encode(CommentNode)
This ensures the node is correctly encoded before I call the seed() method to insert a new CommentNode into the replies arrays of its respective parent CommentNode.
I can then use this custom JSONEncoder subclass while exporting by specifying it within json.dump() with the cls kwarg:
with open(filename, "w", encoding = "utf-8") as results:
json.dump(data, results, indent = 4, cls = EncodeNode)
This was how the structured comments export was implemented. I will also paste this short walkthrough into a new markdown file in the docs directory for my own future reference (and for whoever else is curious). Refer to the source code located in urs/praw_scrapers/Comments.py to see more. I hope this was somewhat interesting and/or informative!
Fulfills Richard Larrabee's enhancement request made over email - requested adding the author to Subreddit submission metadata.
Type of Change
[x] Code Refactor
[x] New Feature (non-breaking change which adds functionality)
[x] This change requires a documentation update
Breaking Change
N/A
List All Changes That Have Been Made
Added
User interface
Added Halo to spice up the output while maintaining minimalism.
Source code
Created a comment Forest and accompanying CommentNode.
The Forest contains methods for inserting CommentNodes, including a depth-first search algorithm to do so.
Subreddit.py has been refactored and submission metadata has been added to scrape files:
"author"
"created_utc"
"distinguished"
"edited"
"id"
"is_original_content"
"is_self"
"link_flair_text"
"locked"
"name"
"num_comments"
"nsfw"
"permalink"
"score"
"selftext"
"spoiler"
"stickied"
"title"
"upvote_ratio"
"url"
Comments.py has been refactored and submission comments now include the following metadata:
"author"
"body"
"body_html"
"created_utc"
"distinguished"
"edited"
"id"
"is_submitter"
"link_id"
"parent_id"
"score"
"stickied"
Major refactor for Redditor.py on top of adding additional metadata.
Additional Redditor information has been added to scrape files:
"has_verified_email"
"icon_img"
"subreddit"
"trophies"
Additional Redditor comment, submission, and multireddit metadata has been added to scrape files:
subreddit objects are nested within comment and submission objects and contain the following metadata:
"can_assign_link_flair"
"can_assign_user_flair"
"created_utc"
"description"
"description_html"
"display_name"
"id"
"name"
"nsfw"
"public_description"
"spoilers_enabled"
"subscribers"
"user_is_banned"
"user_is_moderator"
"user_is_subscriber"
comment objects will contain the following metadata:
"type"
"body"
"body_html"
"created_utc"
"distinguished"
"edited"
"id"
"is_submitter"
"link_id"
"parent_id"
"score"
"stickied"
"submission" - contains additional metadata
"subreddit_id"
submission objects will contain the following metadata:
"type"
"author"
"created_utc"
"distinguished"
"edited"
"id"
"is_original_content"
"is_self"
"link_flair_text"
"locked"
"name"
"num_comments"
"nsfw"
"permalink"
"score"
"selftext"
"spoiler"
"stickied"
"subreddit" - contains additional metadata
"title"
"upvote_ratio"
"url"
multireddit objects will contain the following metadata:
"can_edit"
"copied_from"
"created_utc"
"description_html"
"description_md"
"display_name"
"name"
"nsfw"
"subreddits"
"visibility"
interactions are now sorted in alphabetical order.
CLI
Flags
--raw - Export comments in raw format instead (structure format is the default)
README
Added new bullet point for The Forest Markdown file.
Tests
Added a new test for the Status class in Global.py.
Repository documents
Added "The Forest".
This Markdown file is just a place where I describe how I implemented the Forest.
Changed
User interface
Submission comments scraping parameters have changed due to the improvements made in this pull request.
Structured comments is now the default format.
Users will have to include the new --raw flag to export to raw format.
Both structured and raw formats can now scrape all comments from a submission.
Source code
The submission comments JSON file's structure has been modified to fit the new submission_metadata dictionary. "data" is now a dictionary that contains the submission metadata dictionary and scraped comments list. Comments are now stored in the "comments" field within "data".
Exporting Redditor or submission comments is now forbidden.
URS will ignore the --csv flag if it is present while trying to use either scraper.
The created_utc field for each Subreddit rule is now converted to readable time.
requirements.txt has been updated.
As of v1.20.0, numpy has dropped support for Python 3.6, which means Python 3.7+ is required for URS.
.travis.yml has been modified to exclude Python 3.6. Added Python 3.9 to test configuration.
Note: Older versions of Python can still be used by downgrading to numpy<=1.19.5.
Reddit object validation block has been refactored.
A new reusable module has been defined at the bottom of Validation.py.
README
Updated the Comments section to reflect new changes to comments scraper UI.
Tests
Updated CLI usage and examples tests.
Updated c_fname() test because submission comments scrapes now follow a different naming convention.
Deprecated
User interface
Specifying 0 comments does not only export all comments to raw format anymore. Defaults to structured format.
Source code
Deprecated many global variables defined in Global.py:
eo
options
s_t
analytical_tools
How Has This Been Tested?
Ran pytest on local machine - all tests have passed.
Travis CI tests for Python versions 3.7, 3.8, and 3.9 have passed.
Overview
Summary
The Forest, Tons of Metadata, and Eye Candy
The Forest
Structured comments gets a long-overdue upgrade - comments of all levels are now included in scrape files! Users can export an entire submission to JSON in this format. Inspired by PRAW's
CommentForest
naming convention, I have created a newForest
class andCommentNode
object to make this possible.I made some changes to the submission comments scraper's UI. URS can now export all comments in either structured or raw format.
Structured comments format is now the default - I believe it is more useful than the raw format. Subsequently, I have added a
--raw
flag for users who would prefer to export to the raw format instead.Tons of Metadata
I have added more metadata for the Subreddit, submission comments, and especially the Redditor scraper. A full list of new attributes are listed in the List All Changes That Have Been Made section as well as the
README
.Eye Candy
Additionally I am adding more eye candy. Halo has been implemented to spice up the output - it is now more informative yet maintains a minimalist style, especially the Redditor scraper's output.
This pull request will deviate a bit from the standard template. I will add a section for the new source code because I want to explain how it works.
Motivation/Context
I am a self-taught software developer who just recently graduated from college and am currently looking for my first full-time job. I do not have a computer science degree, so I have had to teach myself a ton of concepts that I would have learned if I got the degree. A class I wish I was able to take in college is data structures and algorithms because that seems to be all the buzz when it comes to the technical interview, which I unfortunately struggle with greatly due to my lack of experience and practice.
Recently I have been teaching myself DSA. Implementing simple examples of each topic within DSA was not so bad (I am currently working on a study guide/reference repository containing these implementations in both Python and Rust that I will make public soon), but practicing Leetcode problems was and still is a difficult process for me. I will continue to power through the struggle because my livelihood and future career depends on it, though.
While it has not been a smooth journey, I have come to realize how useful DSA is and am implementing what I have learned in a real-world use case. I do not think I would have been able to figure out a solution to the structured comments scraper's prior shortcomings if I had not studied this area within computer science. I recently implemented my first trie and was fascinated by how abstract data structures worked. I immediately realized I needed to use a tree data structure for the structured comments scraper in order to take it to the next level, which is the purpose of this pull request.
How the Forest Works
I will strip docstring comments from the source code to keep it relatively short.
The
CommentNode
I created a class
CommentNode
to store each comment's metadata and replies:I used
__setattr__()
because the root node defers from the standard comment node schema. By using__setattr__()
,CommentNode
attributes will be dynamically set based on themetadata
dictionary that has been passed in.self.replies
holds additionalCommentNode
s.The
Forest
Next, I created a class
Forest
which holds the root node and includes methods for insertion.The Root Node
First, let's go over the root node.
The only key in the dictionary passed into
CommentNode
isid
, therefore the rootCommentNode
will only contain the attributesself.id
andself.replies
. A mock submission ID is shown. The actual source code will pull the submission's ID based on the URL that was passed into the-c
flag and set theid
value accordingly.Before I get to the insertion methods, I will explain how comments and their replies are linked.
How PRAW Comments Are Linked
PRAW returns all submission comments by level order. This means all top levels are returned first, followed by all second-level replies, then third, so on and so forth.
I will create some mock comment objects to demonstrate. Here is a top level comment corresponding to the mock submisssion ID. Note the
parent_id
contains the submission'sid
, which is stored inself.root.id
:Here is a second-level reply to the top comment. Note the
parent_id
contains the top comment'sid
:This pattern continues all the way down to the last level of comments. It is now very easy to link the correct comments together. I do this by calling
split("_", 1)
on theparent_id
and then getting the second item in the split list to compare values. I also specify themaxsplit
parameter to force one split.The Insertion Methods
I then defined the methods for
CommentNode
insertion.I implemented the depth-first search algorithm to find a comment's parent node and insert it into the parent node's
replies
array. I defined a separatevisited
set to keep track of visitedCommentNode
s to avoid an infinite loop of insertingCommentNode
s that were already visited into thestack
. At first I wrote a recursive version of depth-first search, but then opted for an iterative version because it would not scale well for submissions that included large amounts of comments, ie. stack overflow.Within the
seed
method, I first check if theCommentNode
is a top level comment by comparing its parent ID to the submission ID. Depth-first search is triggered if theCommentNode
is not a top level comment.Serializing the Forest
Since Python's built-in JSON module can only handle primitive types that have a direct JSON equivalent, a custom encoder is necessary to convert the
Forest
into JSON format. I defined this inExport.py
.The
default()
method overridesJSONEncoder
'sdefault()
method and serializes theCommentNode
by converting it into a dictionary, which is a primitive type that has a direct JSON equivalent:This ensures the node is correctly encoded before I call the
seed()
method to insert a newCommentNode
into thereplies
arrays of its respective parentCommentNode
.I can then use this custom
JSONEncoder
subclass while exporting by specifying it withinjson.dump()
with thecls
kwarg:This was how the structured comments export was implemented. I will also paste this short walkthrough into a new markdown file in the
docs
directory for my own future reference (and for whoever else is curious). Refer to the source code located inurs/praw_scrapers/Comments.py
to see more. I hope this was somewhat interesting and/or informative!New Dependencies
Issue Fix or Enhancement Request
Fulfills Richard Larrabee's enhancement request made over email - requested adding the author to Subreddit submission metadata.
Type of Change
Breaking Change
N/A
List All Changes That Have Been Made
Added
Forest
and accompanyingCommentNode
.Forest
contains methods for insertingCommentNode
s, including a depth-first search algorithm to do so.Subreddit.py
has been refactored and submission metadata has been added to scrape files:"author"
"created_utc"
"distinguished"
"edited"
"id"
"is_original_content"
"is_self"
"link_flair_text"
"locked"
"name"
"num_comments"
"nsfw"
"permalink"
"score"
"selftext"
"spoiler"
"stickied"
"title"
"upvote_ratio"
"url"
Comments.py
has been refactored and submission comments now include the following metadata:"author"
"body"
"body_html"
"created_utc"
"distinguished"
"edited"
"id"
"is_submitter"
"link_id"
"parent_id"
"score"
"stickied"
Redditor.py
on top of adding additional metadata."has_verified_email"
"icon_img"
"subreddit"
"trophies"
subreddit
objects are nested withincomment
andsubmission
objects and contain the following metadata:"can_assign_link_flair"
"can_assign_user_flair"
"created_utc"
"description"
"description_html"
"display_name"
"id"
"name"
"nsfw"
"public_description"
"spoilers_enabled"
"subscribers"
"user_is_banned"
"user_is_moderator"
"user_is_subscriber"
comment
objects will contain the following metadata:"type"
"body"
"body_html"
"created_utc"
"distinguished"
"edited"
"id"
"is_submitter"
"link_id"
"parent_id"
"score"
"stickied"
"submission"
- contains additional metadata"subreddit_id"
submission
objects will contain the following metadata:"type"
"author"
"created_utc"
"distinguished"
"edited"
"id"
"is_original_content"
"is_self"
"link_flair_text"
"locked"
"name"
"num_comments"
"nsfw"
"permalink"
"score"
"selftext"
"spoiler"
"stickied"
"subreddit"
- contains additional metadata"title"
"upvote_ratio"
"url"
multireddit
objects will contain the following metadata:"can_edit"
"copied_from"
"created_utc"
"description_html"
"description_md"
"display_name"
"name"
"nsfw"
"subreddits"
"visibility"
interactions
are now sorted in alphabetical order.--raw
- Export comments in raw format instead (structure format is the default)README
Status
class inGlobal.py
.Forest
.Changed
--raw
flag to export to raw format.submission_metadata
dictionary."data"
is now a dictionary that contains the submission metadata dictionary and scraped comments list. Comments are now stored in the"comments"
field within"data"
.--csv
flag if it is present while trying to use either scraper.created_utc
field for each Subreddit rule is now converted to readable time.requirements.txt
has been updated.numpy
has dropped support for Python 3.6, which means Python 3.7+ is required for URS..travis.yml
has been modified to exclude Python 3.6. Added Python 3.9 to test configuration.Validation.py
.README
c_fname()
test because submission comments scrapes now follow a different naming convention.Deprecated
0
comments does not only export all comments to raw format anymore. Defaults to structured format.Global.py
:eo
options
s_t
analytical_tools
How Has This Been Tested?
pytest
on local machine - all tests have passed.Test Configuration
.travis.yml
for full test configuration.Dependencies
Checklist