eth-easl / modyn

Modyn is a research-platform for training ML models on growing datasets.
MIT License
23 stars 3 forks source link

Pushshift Reddit Benchmark Integration #28

Open vGsteiger opened 1 year ago

vGsteiger commented 1 year ago

To better plan and have an overview of the tasks still open to implementing the Pushshift Reddit benchmark I ask you to go through the following functions with regard to the Pushshift Reddit benchmark integration into our testing infrastructure:

I know that without the underlying architecture fully implemented that these issues will rather be high level but we can refine them later on without issues, it's just good if you are already thinking about how your contribution fits with the architecture.

If you have any questions please feel free to contact me :)

roxanastiuca commented 1 year ago

The workflow of using Pushshift data with a text classifier is the following:

This breaks down in the components:

I don't see any modifications needed for this integration, just adding the custom classes for Dataset & Trainer and configuring the DataLoader correctly should do it. I will document if I find anything in the base components that needs adapting.

vGsteiger commented 1 year ago

Hi @roxanastiuca !

Thanks for your evaluation, could you still add an Issue(s) for the custom classes for Dataset and Trainer? Just to keep it nicely packaged up and to keep a place around to comment/note things down.

Thanks!

roxanastiuca commented 1 year ago

Progress:

Next up:

Also, if Pushshift is down, we can use the CVS data source with data I already queried and saved.

vGsteiger commented 1 year ago

@roxanastiuca what is the progress here? Did you want to create separate issues similar to Ambarish? For example #86 #87 #88

MaxiBoether commented 1 year ago

It would be great to have a rundown with individual tasks like Ambarish did, indeed!

vGsteiger commented 1 year ago

@MaxiBoether I think we might as well close this issue as well? I think Olga is tracking this in new PRs? And this was meant as a meta issue to create the correct issues which was never really done?

MaxiBoether commented 1 year ago

Is Olga tracking this in new issues? #101 is set to solve this issue here, right?

vGsteiger commented 1 year ago

@OlgaOvcharenko If you're already tracking this please feel free to close this issue.