issues
search
NLP-in-the-Social-Sciences
/
Reddit-Data-Pipeline
Code and data we are using for facilitating an ETL pipeline for Low SES research
GNU General Public License v3.0
0
stars
1
forks
source link
Scalaling and data storage
#1
Closed
MoRevolution
closed
1 year ago
MoRevolution
commented
1 year ago
Tasks completed in this commit:
[x] Built demo database using available sets of keywords
[x] Finalized script to be used for future data collection
[x] Cleaned the database to leave only qualified stories
[x] Created filtered dataset 'Filtered MDF' based on relevant Subreddits and length of paragraphs.
Next to-do:
Perform some final cleanups on the database
Fix problem with apostrophe's being replaced by weird symbol
Remove records with empty 'selftext'
Improve relevance of the stories
Might require consultation and research
Include parallel computation to improve the scaling of both the
submission and comment crawling scripts
Tasks completed in this commit:
Next to-do: