Closed arekmano closed 6 years ago
For the moment, my test script in unsupervised-play is using Meanshift with the python library sk-learn http://scikit-learn.org/stable/modules/clustering.html
Here is what is done so far:
What can be improved:
Group ambiguous terms with relevance to disputes in named entities. Preprocessing named entities will allow similar terms to be more clustered together by reducing cosine distance between sentence vectors.
Date: "November" "12 November 2017" "The first of April" "In 2 weeks"
Time: "15h01" "Evening"
Time frequency: "at the beginning of each month" "Paid weekly"
Money: "500$" "five hundred dollars"
Technology: http://brat.nlplab.org/
French datasets are more sparse then English ones. This tool allows the user to quickly annotate text in order to give it meaning in a machine readable format. Namely, this application can be applied to precisely the dataset that we are using for this project which will increase the confidence .
For the preprocessing phase, it should be enough to replace the matching terms in the string with "temps", "temps récurrent", "argent" and "date"
I've changed to using DBSCAN as our initial clustering implementation, as it has a better clustering output. However, there are multiple facts clusters that are not dense enough to have their own category. We must come up with a solution to this.
My preprocessing is catching Money, time frequency, time, date, relative time with no errors so far.
example: 12 novembre 2017 --> date date date --> date
If a noun phrase has many words describing the same named entity then I reduce it to only 1 word so that I do not give this one meaning too much weight
NER MATRIX
The named entity model was trained using n-gram with window size = 1. Window size = 2 gave inaccurate results
['moment', 'dépôt', 'demande', 'date'] au moment du dépôt de sa demande le 17 juin 2015
['allègue', 'retard', 'paiement', 'somme', 'argent'] allègue un retard de paiement pour une somme de 2 301 $
I'm missing corpus annotation concerning the 'Time' named entity. I will work on that by the end of the week/weekend. Shouldn't take more than 1-2 hours.
I'm using solution 2 on here: https://stackoverflow.com/questions/29760935/how-to-get-vector-for-a-sentence-from-the-word2vec-of-tokens-in-sentence
My model is working now. It requires more annotation for all the entities to be properly recognized. Should be achievable with 1-2 more hour of work.
Arek mentioned to consider: 12 novembre 2017 et 21 decembre --> date date as opposed to 12 novembre 2017 et 21 decembre --> date date --> date
So I will include both so we can play with it. In my opinion a sentence such as: 'J'etais au magasin le 1er septembre et le 2 novembre' should map to <j'etais au magasin date> I do not believe mapping it to <j'etais au magasin date date> brings any additional value to the topics found in the sentence. Furthermore this method will add unwanted weight to the sentence vector.
The code: 1- Seperates facts from decisions 2- maps named entities 3- stores original sentence and ner_sentence 4- *will allow for configurability in terms of grouping ner together as mentioned above
Looking at either:
I think affinity propagation is already similar to k-means. So I will go with HDBSCAN since it seems a little better than DBSCAN
http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
Number_of_clusters: 6 Number_of_files used: 10 Noise: 7 Number_of_clusters: 26 Number_of_files used: 100 Noise: 33 Number_of_clusters: 84 Number_of_files used: 500 Noise: 132 Number_of_clusters: 144 Number_of_files used: 1000 Noise: 268 Number_of_clusters: 269 Number_of_files used: 2000 Noise: 434 Number_of_clusters: 465 Number_of_files used: 4000 Noise: 800 Number_of_clusters: 794 Number_of_files used: 8000 Noise: 1381 Number_of_clusters: 1407 Number_of_files used: 16000 Noise: 2485 Number_of_clusters: 2470 Number_of_files used: 32000 Noise: 4322
Number_of_clusters: 0 Number_of_files used: 10 Noise: 49 Number_of_clusters: 9 Number_of_files used: 100 Noise: 140 Number_of_clusters: 35 Number_of_files used: 500 Noise: 380 Number_of_clusters: 54 Number_of_files used: 1000 Noise: 684 Number_of_clusters: 82 Number_of_files used: 2000 Noise: 1170 Number_of_clusters: 122 Number_of_files used: 4000 Noise: 1948 Number_of_clusters: 197 Number_of_files used: 8000 Noise: 3215 Number_of_clusters: 325 Number_of_files used: 16000 Noise: 5518
Number_of_clusters: 31 Number_of_files used: 10 Noise: 54 Number_of_clusters: 279 Number_of_files used: 100 Noise: 665 Number_of_clusters: 1048 Number_of_files used: 500 Noise: 3353 Number_of_clusters: 1977 Number_of_files used: 1000 Noise: 6190 Number_of_clusters: 3826 Number_of_files used: 2000 Noise: 12453 Number_of_clusters: 6869 Number_of_files used: 4000 Noise: 22380
Number_of_clusters: 8 Number_of_files used: 10 Noise: 12 Number_of_clusters: 35 Number_of_files used: 100 Noise: 44 Number_of_clusters: 111 Number_of_files used: 500 Noise: 140 Number_of_clusters: 184 Number_of_files used: 1000 Noise: 268 Number_of_clusters: 306 Number_of_files used: 2000 Noise: 449 Number_of_clusters: 530 Number_of_files used: 4000 Noise: 781 Number_of_clusters: 863 Number_of_files used: 8000 Noise: 1313 Number_of_clusters: 1459 Number_of_files used: 16000 Noise: 2308 Number_of_clusters: 2494 Number_of_files used: 32000 Noise: 3889
Evaluated the appropriateness of different word2vec pre-trained models
Number of missing words | Vector Model Name 22049 | frWac_no_postag_no_phrase_500_cbow_cut100 (Stemmed words) 16923 | frWac_no_postag_no_phrase_500_cbow_cut100 (Non stemmed words) 22049 | frWac_no_postag_no_phrase_500_skip_cut100 (Stemmed words) 16923 | frWac_no_postag_no_phrase_500_skip_cut100 (Non stemmed words) 19542 | frWac_no_postag_no_phrase_700_skip_cut50 (Stemmed words) 14933 | frWac_no_postag_no_phrase_700_skip_cut50 (Non stemmed words) 28455 | frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100 (Stemmed words) 15129 | frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100 (Non stemmed words) 28455 | frWiki_no_lem_no_postag_no_phrase_1000_skip_cut100 (Stemmed words) 15129 | frWiki_no_lem_no_postag_no_phrase_1000_skip_cut100 (Non stemmed words) 30010 | frWiki_no_phrase_no_postag_1000_skip_cut100 (Stemmed words) 23630 | frWiki_no_phrase_no_postag_1000_skip_cut100 (Non stemmed words) 30010 | frWiki_no_phrase_no_postag_700_cbow_cut100 (Stemmed words) 23630 | frWiki_no_phrase_no_postag_700_cbow_cut100 (Non stemmed words) 21708 | frWac_non_lem_no_postag_no_phrase_500_skip_cut100 (Stemmed words) 9066 | frWac_non_lem_no_postag_no_phrase_500_skip_cut100 (Non stemmed words)
The most appropriate Vector model for our dataset is: frWac_non_lem_no_postag_no_phrase_500_skip_cut100, found here (http://embeddings.org/frWac_non_lem_no_postag_no_phrase_500_skip_cut100.bin)
The bad: Cluster size needs to be played around with in order to get a better result, this is time-consuming. Elbow method isn't super useful
The good: Majority of the cluster are good. Obviously, there's noise as well.
Number_of_clusters: 288 Number_of_files used: 1000 Noise: 437 Learning rate: 200 Perplexity: 30 Number_of_clusters: 268 Number_of_files used: 1000 Noise: 448 Learning rate: 200 Perplexity: 28 Number_of_clusters: 271 Number_of_files used: 1000 Noise: 464 Learning rate: 200 Perplexity: 26 Number_of_clusters: 265 Number_of_files used: 1000 Noise: 449 Learning rate: 200 Perplexity: 24 Number_of_clusters: 254 Number_of_files used: 1000 Noise: 406 Learning rate: 200 Perplexity: 22 Number_of_clusters: 255 Number_of_files used: 1000 Noise: 342 Learning rate: 200 Perplexity: 20 Number_of_clusters: 283 Number_of_files used: 1000 Noise: 484 Learning rate: 200 Perplexity: 30 Number_of_clusters: 272 Number_of_files used: 1000 Noise: 540 Learning rate: 300 Perplexity: 30 Number_of_clusters: 262 Number_of_files used: 1000 Noise: 519 Learning rate: 400 Perplexity: 30 Number_of_clusters: 270 Number_of_files used: 1000 Noise: 486 Learning rate: 500 Perplexity: 30 Number_of_clusters: 271 Number_of_files used: 1000 Noise: 443 Learning rate: 600 Perplexity: 30 Number_of_clusters: 265 Number_of_files used: 1000 Noise: 418 Learning rate: 700 Perplexity: 30 Number_of_clusters: 272 Number_of_files used: 1000 Noise: 342 Learning rate: 200 Perplexity: 20 Number_of_clusters: 259 Number_of_files used: 1000 Noise: 376 Learning rate: 300 Perplexity: 20 Number_of_clusters: 254 Number_of_files used: 1000 Noise: 352 Learning rate: 400 Perplexity: 20 Number_of_clusters: 261 Number_of_files used: 1000 Noise: 362 Learning rate: 500 Perplexity: 20 Number_of_clusters: 268 Number_of_files used: 1000 Noise: 384 Learning rate: 600 Perplexity: 20 Number_of_clusters: 314 Number_of_files used: 1000 Noise: 401 Learning rate: 700 Perplexity: 20 Number_of_clusters: 257 Number_of_files used: 1000 Noise: 410 Learning rate: 700 Perplexity: 30 Number_of_clusters: 263 Number_of_files used: 1000 Noise: 392 Learning rate: 700 Perplexity: 28 Number_of_clusters: 258 Number_of_files used: 1000 Noise: 405 Learning rate: 700 Perplexity: 26 Number_of_clusters: 274 Number_of_files used: 1000 Noise: 369 Learning rate: 700 Perplexity: 24 Number_of_clusters: 270 Number_of_files used: 1000 Noise: 341 Learning rate: 700 Perplexity: 22 Number_of_clusters: 284 Number_of_files used: 1000 Noise: 528 Learning rate: 700 Perplexity: 20 Number_of_clusters: 265 Number_of_files used: 1000 Noise: 458 Learning rate: 700 Perplexity: 30 Number_of_clusters: 264 Number_of_files used: 1000 Noise: 472 Learning rate: 600 Perplexity: 28 Number_of_clusters: 275 Number_of_files used: 1000 Noise: 448 Learning rate: 500 Perplexity: 26 Number_of_clusters: 272 Number_of_files used: 1000 Noise: 443 Learning rate: 400 Perplexity: 24 Number_of_clusters: 266 Number_of_files used: 1000 Noise: 372 Learning rate: 300 Perplexity: 22 Number_of_clusters: 275 Number_of_files used: 1000 Noise: 391 Learning rate: 200 Perplexity: 20 Number_of_clusters: 276 Number_of_files used: 1000 Noise: 467 Learning rate: 200 Perplexity: 30 Number_of_clusters: 272 Number_of_files used: 1000 Noise: 477 Learning rate: 300 Perplexity: 28 Number_of_clusters: 267 Number_of_files used: 1000 Noise: 427 Learning rate: 400 Perplexity: 26 Number_of_clusters: 273 Number_of_files used: 1000 Noise: 391 Learning rate: 500 Perplexity: 24 Number_of_clusters: 247 Number_of_files used: 1000 Noise: 403 Learning rate: 600 Perplexity: 22 Number_of_clusters: 291 Number_of_files used: 1000 Noise: 430 Learning rate: 700 Perplexity: 20 Number_of_clusters: 284 Number_of_files used: 1000 Noise: 482 Learning rate: 200 Perplexity: 30 Number_of_clusters: 272 Number_of_files used: 1000 Noise: 455 Learning rate: 200 Perplexity: 32 Number_of_clusters: 285 Number_of_files used: 1000 Noise: 421 Learning rate: 200 Perplexity: 34 Number_of_clusters: 283 Number_of_files used: 1000 Noise: 495 Learning rate: 200 Perplexity: 36 Number_of_clusters: 293 Number_of_files used: 1000 Noise: 420 Learning rate: 200 Perplexity: 38 Number_of_clusters: 291 Number_of_files used: 1000 Noise: 465 Learning rate: 200 Perplexity: 40 Number_of_clusters: 209 Number_of_files used: 1000 Noise: 763 Learning rate: 200 Perplexity: 5 Number_of_clusters: 198 Number_of_files used: 1000 Noise: 217 Learning rate: 200 Perplexity: 10 Number_of_clusters: 201 Number_of_files used: 1000 Noise: 250 Learning rate: 200 Perplexity: 12 Number_of_clusters: 215 Number_of_files used: 1000 Noise: 271 Learning rate: 200 Perplexity: 14 Number_of_clusters: 238 Number_of_files used: 1000 Noise: 241 Learning rate: 200 Perplexity: 16 Number_of_clusters: 255 Number_of_files used: 1000 Noise: 345 Learning rate: 200 Perplexity: 18
Reducing perplexity reduces noise Learning rate barely has any impact
182 categories 5534 seconds clustering time 66 seconds vectorization time 29779/59317 noise
182 categories 5994 seconds clustering time 89 seconds vectorization time 39220/72805 noise
105 clusters 5700 / 6560 noise
125 clusters 3700 / 6560 noise
48 clusters 4000 / 5059 noise
79 clusters 4500 / 6560 noise
86 clusters 4700 / 6560 noise
167 clusters 3400 / 5059 noise
418 clusters invalid clusters (dates only, wierd time things, etc) 6646 / 10294 noise
262 clusters 5400 / 6560 noise
184 clusters 5800 / 6560 noise
106 clusters 6100 / 6560 noise
270 clusters 4300 / 6560 noise
278 clusters 4150 / 6560 noise
sample size: 10 000
it seems fairly evident that to capture our data we need epsilon value of 0.4
Furthermore it seems that the library does not like floating point numbers for plotting. Reading the data manually indicates that 1.1 is a better epsilon value
Wow craziest thing i have seen in my life. Trying with cluster size = 1500
Using actual math values
Received verbal confirmation about closing story during meeting. Pinging @naregeff to leave a comment
Signing off, it works!
Description As a user, I would like the system to determine categories of facts based on precedent data.
Major work will be conducted on #105. The same approach will then be used for #106.
Scope of Work
Story Points
Priority
Risk
Acceptance Criteria