.ipynb for downsampling

guidopetri commented 4 years ago

We should probably downsample the data we have so that we have approximately equal proportions of the majority/minority class. The raw data is approximately 11% the positive class, and we've got approx 250k rows of data - so downsampling seems to be the way to go.

@kelseymarkey - are you going to want to run this before or after vectorizing is done?

kelseymarkey commented 4 years ago

@charlesoblack I can do either way (or both ways). Since I haven't run vectorizers yet I'll first try it pre-vectorization. This means that features engineered by Aren and I will need to be re-implemented with the downsampled training set. Maybe we should both just function-ize our feature engineering now? Thoughts @arendakessian?

I'll also write a Piazza post confirming from instructors that there isn't any issue downsampling before vectorizing.

Tasks (for Kelsey):

[x] 1) Create a function to downsample.
[ ] 2) ~~Run this function on the training set.~~
[ ] 3) ~~Re-run feature engineering with downsampled training set.~~
[x] 4) Write Piazza post confirming that it is okay to downsample before vectorizing.

kelseymarkey commented 4 years ago

1) Function created with commit e3a19a7c65700285b440552104f12c1626db3083

I'm confused on how to best implement 2 and 3. We now have the features I created (number of caps and exclamations), Aren's feature (reviews to date), downsampling function, and vectorizing which are each in a different notebook or .py. I think we should function-ize each of these things and then have one notebook that calls each of them in. What do you guys think @arendakessian @charlesoblack ?

Edit: This is separate from Sid's issue #4 which is to concatenate all features. But perhaps that notebook could also create the features at the beginning? Since all things are fairly short is there any point in having them all separately?

guidopetri commented 4 years ago

Basically the idea is to have something in the README.md that explains how it is that the repo is supposed to be run. E.g.:

Run downsample.ipynb.
Run vectorize.ipynb.
Run create_a_bunch_of_features.ipynb.
Run put_everything_together.ipynb.
Run train_model.ipynb.

Since I don't know what notebooks we'll have yet, I've been holding off on creating a file like that - but if you want, I can do that right away and update as we go along. Also, some things can probably be run simultaneously (e.g. all the feature creating) so those will be in a random order. But yeah that's my plan - that way we don't need to write any functions nor write any import code (and lose our minds), just follow steps 1-5.

guidopetri commented 4 years ago

(The above being said - maybe this should be a separate issue? if so, I don't mind creating a new issue for this)

guidopetri commented 4 years ago

For the downsampling: It seems your .py code relies on the data being in a dataframe format... but what we have are CSR matrices.

Basically the problem is that - if we downsample after vectorizing, or downsample before creating the feature from #3 - then we have issues putting the dataset together (the former because of CSR matrices and not pandas DataFrames, the latter because the feature depends on the entire dataset). I think it might actually just be easier if you output a bunch of indices for the examples to keep so that that's called in #4 . What do you think?

kelseymarkey commented 4 years ago

For the downsampling: It seems your .py code relies on the data being in a dataframe format... but what we have are CSR matrices.

Basically the problem is that - if we downsample after vectorizing, or downsample before creating the feature from #3 - then we have issues putting the dataset together (the former because of CSR matrices and not pandas DataFrames, the latter because the feature depends on the entire dataset). I think it might actually just be easier if you output a bunch of indices for the examples to keep so that that's called in #4 . What do you think?

Got it, yeah I was working under the assumption that we were downsampling first (as discussed earlier in the issue). I can create another function that outputs downsampled indices so that it can be applied after vectorizing.

guidopetri commented 4 years ago

Once you're done with the notebook, can you also edit the README.md filename for the notebook? I just called it downsample.ipynb but feel free to edit to something else. Thank you!

kelseymarkey commented 4 years ago

Once you're done with the notebook, can you also edit the README.md filename for the notebook? I just called it downsample.ipynb but feel free to edit to something else. Thank you!

I kept the name the same, so no update needed to the read me

kelseymarkey commented 4 years ago

downsample.ipynb created with commit 3cb71ba

This notebook reads in the training csv and outputs indices which are 50/50 for each class. If we later want to test other downsampling ratios we can rerun this notebook with a different percentage.

He He also responded and said that she didn't see any issue downsampling before OR after vectorization.

arendakessian / spring2020-ml-project

.ipynb for downsampling #2