CPSSD / LUCAS

The repository for the LUCAS/Lucify project
MIT License
11 stars 4 forks source link

Add method of limiting dataset to reviews with a max word length #205

Closed StefanKennedy closed 5 years ago

StefanKennedy commented 5 years ago

Since some of our models require a max review length due to memory constraints, this PR adds a function to the feature_extraction.py script that allows us to specify the max sequence length of the dataset we read. We should obtain benchmarks using the same dataset, so we should filter out the same reviews even for those experiments that do not have memory constraints

Deniall commented 5 years ago

Remember the reason the 320 limit is there is because BERT has a character limit of 320 characters, not words. However, this seems to be words?

StefanKennedy commented 5 years ago

That is actually news to me. There's no BERT on this dataset anyway, this limit is more a limit of CNNs