Scalaling and data storage - Githubissues

NLP-in-the-Social-Sciences / Reddit-Data-Pipeline

Code and data we are using for facilitating an ETL pipeline for Low SES research

GNU General Public License v3.0

0 stars 1 forks source link

Scalaling and data storage #1

Closed MoRevolution closed 1 year ago

MoRevolution commented 1 year ago

Tasks completed in this commit:

[x] Built demo database using available sets of keywords
[x] Finalized script to be used for future data collection
[x] Cleaned the database to leave only qualified stories
[x] Created filtered dataset 'Filtered MDF' based on relevant Subreddits and length of paragraphs.

Next to-do:

Perform some final cleanups on the database
- Fix problem with apostrophe's being replaced by weird symbol
- Remove records with empty 'selftext'
Improve relevance of the stories
- Might require consultation and research
Include parallel computation to improve the scaling of both the
- submission and comment crawling scripts