arendakessian / spring2020-ml-project

fake review detection system
2 stars 3 forks source link

.ipynb for downsampling #2

Closed guidopetri closed 4 years ago

guidopetri commented 4 years ago

We should probably downsample the data we have so that we have approximately equal proportions of the majority/minority class. The raw data is approximately 11% the positive class, and we've got approx 250k rows of data - so downsampling seems to be the way to go.

@kelseymarkey - are you going to want to run this before or after vectorizing is done?

kelseymarkey commented 4 years ago

@charlesoblack I can do either way (or both ways). Since I haven't run vectorizers yet I'll first try it pre-vectorization. This means that features engineered by Aren and I will need to be re-implemented with the downsampled training set. Maybe we should both just function-ize our feature engineering now? Thoughts @arendakessian?

I'll also write a Piazza post confirming from instructors that there isn't any issue downsampling before vectorizing.

Tasks (for Kelsey):

kelseymarkey commented 4 years ago

1) Function created with commit e3a19a7c65700285b440552104f12c1626db3083

I'm confused on how to best implement 2 and 3. We now have the features I created (number of caps and exclamations), Aren's feature (reviews to date), downsampling function, and vectorizing which are each in a different notebook or .py. I think we should function-ize each of these things and then have one notebook that calls each of them in. What do you guys think @arendakessian @charlesoblack ?

guidopetri commented 4 years ago

Basically the idea is to have something in the README.md that explains how it is that the repo is supposed to be run. E.g.:

  1. Run downsample.ipynb.
  2. Run vectorize.ipynb.
  3. Run create_a_bunch_of_features.ipynb.
  4. Run put_everything_together.ipynb.
  5. Run train_model.ipynb.

Since I don't know what notebooks we'll have yet, I've been holding off on creating a file like that - but if you want, I can do that right away and update as we go along. Also, some things can probably be run simultaneously (e.g. all the feature creating) so those will be in a random order. But yeah that's my plan - that way we don't need to write any functions nor write any import code (and lose our minds), just follow steps 1-5.

guidopetri commented 4 years ago

(The above being said - maybe this should be a separate issue? if so, I don't mind creating a new issue for this)

guidopetri commented 4 years ago

For the downsampling: It seems your .py code relies on the data being in a dataframe format... but what we have are CSR matrices.

Basically the problem is that - if we downsample after vectorizing, or downsample before creating the feature from #3 - then we have issues putting the dataset together (the former because of CSR matrices and not pandas DataFrames, the latter because the feature depends on the entire dataset). I think it might actually just be easier if you output a bunch of indices for the examples to keep so that that's called in #4 . What do you think?

kelseymarkey commented 4 years ago

For the downsampling: It seems your .py code relies on the data being in a dataframe format... but what we have are CSR matrices.

Basically the problem is that - if we downsample after vectorizing, or downsample before creating the feature from #3 - then we have issues putting the dataset together (the former because of CSR matrices and not pandas DataFrames, the latter because the feature depends on the entire dataset). I think it might actually just be easier if you output a bunch of indices for the examples to keep so that that's called in #4 . What do you think?

Got it, yeah I was working under the assumption that we were downsampling first (as discussed earlier in the issue). I can create another function that outputs downsampled indices so that it can be applied after vectorizing.

guidopetri commented 4 years ago

Once you're done with the notebook, can you also edit the README.md filename for the notebook? I just called it downsample.ipynb but feel free to edit to something else. Thank you!

kelseymarkey commented 4 years ago

Once you're done with the notebook, can you also edit the README.md filename for the notebook? I just called it downsample.ipynb but feel free to edit to something else. Thank you!

I kept the name the same, so no update needed to the read me

kelseymarkey commented 4 years ago

downsample.ipynb created with commit 3cb71ba

This notebook reads in the training csv and outputs indices which are 50/50 for each class. If we later want to test other downsampling ratios we can rerun this notebook with a different percentage.

He He also responded and said that she didn't see any issue downsampling before OR after vectorization.