thesis
The classification of municipality documents
Finished at 4-23
- Loading in new data from zoek.officielebekendmakingen.nl
- Loading in data from api.openraadsinformatie.nl
- Standard pre-processing of the data
- Lay-out to create correct overview of the data
- Import of pre-trained Word2Vec embeddings
- Implementation of Multinomial Naive Bayes, Stochastic Gradient Descend, Random Forest, Logistic Regression
- Implementation of CNN
a. Standard deep implementation with constant filter size
b. Deep implementatation with a mix of filter sizes
c. Cutting up sentences into smaller parts and classifying each individual part, then aggregating that into one prediction. This can be combined with a and b
- Set of evaluation criteria, consisting of recall, accuracy, precision, F1 and possibility to show performance during epoch and amount of training data.
- Written introduction
- classified handfull of data within test set of openraadsinformatie
Finished at 31-5
- Loading in enough data from 2000 untill now.
- Retraining of all baselines with parameter optimalization
- Retraining of CNN's without parameter optimalization
Short term goals
- Writing literature review and methodology
- check performance on openraadsinformatie-set + perhaps categorize when to try predicting
Longer term goals
- Other implementation of CNNs with par2vec for long sentences
- Parameter optimalization, such as filter sizes, aggregation methods, activation functions, loss functions
- Own Word2Vec implementation with specialized corpus, and checks how that contributes to metrics
- Contribute to OpenState implementation
- Writing results, discussion and conclusion
Timeline
Week |
Finish |
23/4-29/4 |
Load in all data, create overview of that data, and save it in pre-processed, final form |
30/4-6/5 |
Re-test baselines with all data, write that down in tables, never touch them again |
|
Re-test CNNs and re-evaluate what priorities and goals are, is beating baselines possible? |
|
Writing literature review and methodology |
7/5-13/5 |
Par2vec and pre-pooling as solutions for long documents |
14/5-20/5 |
First test of par2vec and pre-pooling |
|
Overview of results, a indication of where most improvements can be achieved and planning for most promising tests |
|
Mid-term progress and evaluation report |
21/5-27/5 |
Testing and changing small pieces |
28/5-3/6 |
Testing and changing small pieces |
|
Implement word2Vec on most promising algorithms |
4/6-10/6 |
Final checks on test set of municipalities |
|
Assist with final implementation for OSF |
|
Results section |
11/6-17/6 |
Discussion and future work section |
|
Spell-check, lay-out and abstract section |
18/6-24/6 |
Thesis |
25/6-1/7 |
Defence of thesis |