Closed guidopetri closed 4 years ago
Looks like this is possible, albeit both need to be the same type.
@charlesoblack I think this notebook needs to be amended so that it takes the downsampled indices from downsample.ipynb and exports full_data with only downsampled indices.
Reopening this issue, do you want me to add that to this notebook or do you have it?
EDIT: Or were you thinking to keep both the full data and the downsampled data? Is there any reason to do baseline on full data instead of downsampled set?
I'll take care of it. I'm probably going to rename it as downsampled_data
though.
We should probably do the baseline on the downsampled set, as well. I'll do that and update the other issue.
Alright, downsampling (post-vectorization) is done. This is kind of a tentative close because I'm not really sure if we are gonna be able to run our analyses on the data - when I was experimenting, it seemed that we had too many features for my PC to run it at any reasonable speed. This means we might have to downsample, then vectorize. If that's the case, I'll edit the vectorization notebook to do that instead.
Ah, of course... I need to downsample the labels, too.
All the features we're engineering are being put into different
.pckl
files underdata/
. We need an ipynb that combines all these features together into one coherent dataset so that we can run it through any ML models. This should also be exported as a.pckl
in the same folder.The tricky part here is that the vectorized review texts come as SciPy CSR matrices, whereas the other features will probably be NumPy arrays. Is it possible to combine these two directly? Or do we have to "dense-ify" the CSR matrix and then concat the arrays?