Obtain results for neural benchmarks on Grove

StefanKennedy commented 5 years ago

ACs:

Finish successful run with neural network
Use significant amount of data to consider result a benchmark
Draft evaluation of models included in notebook

Possible reasons for the current blocker (performance):

~~Might not be able to run on > 5 cores because of permissions or udocker or something~~
The dimensionality of the yelp dataset might be so large that it even takes a long time on Grove
~~Something to do with reading/writing from the filesystem and the filesystem being slow~~

StefanKennedy commented 5 years ago

It seems that the "Illegal Instruction" message is actually caused by running the job on a node that did not have the dependencies installed on it. Could be related to this: https://stackoverflow.com/questions/29338016/import-theano-gets-illegal-instruction

StefanKennedy commented 5 years ago

Seems that when training starts Keras uses ~~4-5 cores~~ up to over 10 cores:

On 10,000 data entries:

StefanKennedy commented 5 years ago

Could we need to use a kind of keras input layer? Reading all data input data into memory could be the cause of the memory errors.

StefanKennedy commented 5 years ago

School of computing ssh seems to be down 🙄

StefanKennedy commented 5 years ago

Using venv instead of udocker is speeding things up. Jupyter Notebook can now be used, I've added a bit about it on 'Grove Groovines', (but it's not perfectly worked out yet)

StefanKennedy commented 5 years ago

Fixed problem with how I was providing data to embedding layer, now epochs only take 20s (on 10000 total entries). The PR with the notebook will show what I'm providing as input (I misunderstood the layers before)

StefanKennedy commented 5 years ago

For some reason even on 50,000 entries I'm getting 100% validation and test accuracy. After shuffling the dataset this is resolved, all 50,000 first entries are genuine

StefanKennedy commented 5 years ago

Accuracy is not much higher than 60 with FFNN. Could be a good idea to try with more features, such as reviewer embeddings and reviewed product embeddings. Check paper called 'Learning to Represent Review with Tensor Decomposition for Spam Detection'

StefanKennedy commented 5 years ago

FFNN is running on full dataset. CNNs need to be run with GPUs because of the tensorflow build that was recommended for using on grove. ~Going to see if I can use a GPU~

It was simply that when running using a GPU the input to a convolutional network must be in NHWC format (channels last)

StefanKennedy commented 5 years ago

CNNs crash with a MemoryError, even after filtering all reviews longer than 300.

With a shape of (152970, 300, 300, 1) we're using 152970 300 300 * 16 = 220 GB. It should fit on g105, but loading word2vec is probably putting it over the edge. Could write the embeddings that we actually use directly, should save some memory.

screenshot from 2019-03-05 14-21-49

Deniall commented 5 years ago

Chunk the training data into batches of 50 and just do mini-batch training to solve the memory issue.

StefanKennedy commented 5 years ago

A good way to free up more memory after running a model is to activate python's garbage collection:

import gc
gc.collect()

StefanKennedy commented 5 years ago

Convolutional networks now take longer to overfit with more data. Partially this is because I switched to a GlobalMaxPoolingLayer. Currently getting 66.7% accuracy with CNNs on word2vec, it's definitely worth doing some tweaking. Takes ~ 6 epochs to overfit, better than the almost immediate overfitting with the small dataset.

StefanKennedy commented 5 years ago

Three benchmark notebooks are now ready, with some missing things I need access to Grove to fix.

StefanKennedy commented 5 years ago

I've set something up so I can run each cross in cross validation in parallel 😄 It also calculates the f1 score and AUROC

CPSSD / LUCAS

Obtain results for neural benchmarks on Grove #144