.ipynb for concatenating all our features

arendakessian / spring2020-ml-project

fake review detection system

2 stars 3 forks source link

.ipynb for concatenating all our features #4

Closed guidopetri closed 4 years ago

guidopetri commented 4 years ago

All the features we're engineering are being put into different .pckl files under data/. We need an ipynb that combines all these features together into one coherent dataset so that we can run it through any ML models. This should also be exported as a .pckl in the same folder.

The tricky part here is that the vectorized review texts come as SciPy CSR matrices, whereas the other features will probably be NumPy arrays. Is it possible to combine these two directly? Or do we have to "dense-ify" the CSR matrix and then concat the arrays?

guidopetri commented 4 years ago

Looks like this is possible, albeit both need to be the same type.

kelseymarkey commented 4 years ago

@charlesoblack I think this notebook needs to be amended so that it takes the downsampled indices from downsample.ipynb and exports full_data with only downsampled indices.

Reopening this issue, do you want me to add that to this notebook or do you have it?

EDIT: Or were you thinking to keep both the full data and the downsampled data? Is there any reason to do baseline on full data instead of downsampled set?

guidopetri commented 4 years ago

I'll take care of it. I'm probably going to rename it as downsampled_data though.

We should probably do the baseline on the downsampled set, as well. I'll do that and update the other issue.

guidopetri commented 4 years ago

Alright, downsampling (post-vectorization) is done. This is kind of a tentative close because I'm not really sure if we are gonna be able to run our analyses on the data - when I was experimenting, it seemed that we had too many features for my PC to run it at any reasonable speed. This means we might have to downsample, then vectorize. If that's the case, I'll edit the vectorization notebook to do that instead.

guidopetri commented 4 years ago

Ah, of course... I need to downsample the labels, too.