Pushshift Reddit Benchmark Integration

vGsteiger commented 1 year ago

To better plan and have an overview of the tasks still open to implementing the Pushshift Reddit benchmark I ask you to go through the following functions with regard to the Pushshift Reddit benchmark integration into our testing infrastructure:

[ ] Think and conceptualise what parts of the infrastructure would have to be adjusted to enable the Pushshift Reddit benchmark (in my opinion the parts of the infrastructure that need to be adapted are storage and trainer but please think through your benchmark in relation to our infrastructure because there might be additional parts that need changes to accommodate your benchmark)
[ ] For each part that needs to be adjusted create an issue with the labels workflow/backlog and enhancement or documentation depending on the nature of the adjustment (please don't forget to create issues for documenting what you are doing/adding) and the milestone Pushshift Reddit Benchmark Integration

I know that without the underlying architecture fully implemented that these issues will rather be high level but we can refine them later on without issues, it's just good if you are already thinking about how your contribution fits with the architecture.

If you have any questions please feel free to contact me :)

roxanastiuca commented 1 year ago

The workflow of using Pushshift data with a text classifier is the following:

query for submissions/comments from Reddit using Pushshift API or the python wrapper library psaw, such as in this script; the raw text can be minimally pre-processed (remove punctuation, lowercase) and labeled.;
data has to be stored (either by creating multiple CSV files for different subreddits or time periods, or by saving it directly to a database).
samples are loaded for training/testing; the text data must be transformed into sequences of vector indices for each token/word, based on a static vocabulary and word vector; also, because text sequences have varying length, the DataLoader must use a custom collate function to concatenate all samples into a tensor.
pytorch model is trained.

This breaks down in the components:

Storage: very flexible for Reddit and any of the implemented data sources should work for this
Dataset & DataLoader: pytorch
Trainer: pytorch

I don't see any modifications needed for this integration, just adding the custom classes for Dataset & Trainer and configuring the DataLoader correctly should do it. I will document if I find anything in the base components that needs adapting.

vGsteiger commented 1 year ago

Hi @roxanastiuca !

Thanks for your evaluation, could you still add an Issue(s) for the custom classes for Dataset and Trainer? Just to keep it nicely packaged up and to keep a place around to comment/note things down.

Thanks!

roxanastiuca commented 1 year ago

Progress:

Added a Data Source for Pushshift (this queries Pushshift API directly and saves data to Storage, using existing Base DataSource: https://github.com/eth-easl/dynamic_datasets_dsl/blob/feature/roxanastiuca/reddit_datasource/modyn/storage/datasource/reddit_data_source.py

Next up:

Integrate the model and the trainer for Subreddit Classification and run experiment. This looks good so far in terms of compatibility with base project.
Make PR with Reddit Benchmark integration.

Also, if Pushshift is down, we can use the CVS data source with data I already queried and saved.

vGsteiger commented 1 year ago

@roxanastiuca what is the progress here? Did you want to create separate issues similar to Ambarish? For example #86 #87 #88

MaxiBoether commented 1 year ago

It would be great to have a rundown with individual tasks like Ambarish did, indeed!

vGsteiger commented 1 year ago

@MaxiBoether I think we might as well close this issue as well? I think Olga is tracking this in new PRs? And this was meant as a meta issue to create the correct issues which was never really done?

MaxiBoether commented 1 year ago

Is Olga tracking this in new issues? #101 is set to solve this issue here, right?

vGsteiger commented 1 year ago

@OlgaOvcharenko If you're already tracking this please feel free to close this issue.

eth-easl / modyn

Pushshift Reddit Benchmark Integration #28