Closed guidopetri closed 4 years ago
@charlesoblack I can do either way (or both ways). Since I haven't run vectorizers yet I'll first try it pre-vectorization. This means that features engineered by Aren and I will need to be re-implemented with the downsampled training set. Maybe we should both just function-ize our feature engineering now? Thoughts @arendakessian?
I'll also write a Piazza post confirming from instructors that there isn't any issue downsampling before vectorizing.
Tasks (for Kelsey):
1) Function created with commit e3a19a7c65700285b440552104f12c1626db3083
I'm confused on how to best implement 2 and 3. We now have the features I created (number of caps and exclamations), Aren's feature (reviews to date), downsampling function, and vectorizing which are each in a different notebook or .py. I think we should function-ize each of these things and then have one notebook that calls each of them in. What do you guys think @arendakessian @charlesoblack ?
Basically the idea is to have something in the README.md
that explains how it is that the repo is supposed to be run. E.g.:
downsample.ipynb
.vectorize.ipynb
.create_a_bunch_of_features.ipynb
.put_everything_together.ipynb
.train_model.ipynb
.Since I don't know what notebooks we'll have yet, I've been holding off on creating a file like that - but if you want, I can do that right away and update as we go along. Also, some things can probably be run simultaneously (e.g. all the feature creating) so those will be in a random order. But yeah that's my plan - that way we don't need to write any functions nor write any import code (and lose our minds), just follow steps 1-5.
(The above being said - maybe this should be a separate issue? if so, I don't mind creating a new issue for this)
For the downsampling: It seems your .py code relies on the data being in a dataframe format... but what we have are CSR matrices.
Basically the problem is that - if we downsample after vectorizing, or downsample before creating the feature from #3 - then we have issues putting the dataset together (the former because of CSR matrices and not pandas DataFrames, the latter because the feature depends on the entire dataset). I think it might actually just be easier if you output a bunch of indices for the examples to keep so that that's called in #4 . What do you think?
For the downsampling: It seems your .py code relies on the data being in a dataframe format... but what we have are CSR matrices.
Basically the problem is that - if we downsample after vectorizing, or downsample before creating the feature from #3 - then we have issues putting the dataset together (the former because of CSR matrices and not pandas DataFrames, the latter because the feature depends on the entire dataset). I think it might actually just be easier if you output a bunch of indices for the examples to keep so that that's called in #4 . What do you think?
Got it, yeah I was working under the assumption that we were downsampling first (as discussed earlier in the issue). I can create another function that outputs downsampled indices so that it can be applied after vectorizing.
Once you're done with the notebook, can you also edit the README.md
filename for the notebook? I just called it downsample.ipynb
but feel free to edit to something else. Thank you!
Once you're done with the notebook, can you also edit the
README.md
filename for the notebook? I just called itdownsample.ipynb
but feel free to edit to something else. Thank you!
I kept the name the same, so no update needed to the read me
downsample.ipynb created with commit 3cb71ba
This notebook reads in the training csv and outputs indices which are 50/50 for each class. If we later want to test other downsampling ratios we can rerun this notebook with a different percentage.
He He also responded and said that she didn't see any issue downsampling before OR after vectorization.
We should probably downsample the data we have so that we have approximately equal proportions of the majority/minority class. The raw data is approximately 11% the positive class, and we've got approx 250k rows of data - so downsampling seems to be the way to go.
@kelseymarkey - are you going to want to run this before or after vectorizing is done?