Cyberjusticelab / JusticeAI

JusticeAI (ProceZeus) is a web chat bot that aims to facilitate access to judicial proceedings involving Quebec tenant/landlord law
https://cyberjusticelab.github.io/JusticeAI/docs/rendered/
MIT License
21 stars 16 forks source link

Fact clustering #105

Closed arekmano closed 6 years ago

arekmano commented 7 years ago

Description As a user, I would like the system to determine categories of facts based on precedent data.

Major work will be conducted on #105. The same approach will then be used for #106.

Scope of Work

Story Points

Priority

Risk

Acceptance Criteria

arekmano commented 7 years ago

For the moment, my test script in unsupervised-play is using Meanshift with the python library sk-learn http://scikit-learn.org/stable/modules/clustering.html

Here is what is done so far:

  1. Parse facts from each Precedent (needs major improvement, it truncates some facts half-way)
  2. Vectorize each fact using word2vec on each word, and then averaging the for all words in a sentence (https://www.youtube.com/watch?v=ERibwqs9p38)
  3. Use the sentence vectors to cluster each fact within clusters using Meanshift (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html#sklearn.cluster.MeanShift)
  4. Output all the sentences into files.
arekmano commented 7 years ago

What can be improved:

Samuel-Campbell commented 7 years ago

Preprocessing

Group ambiguous terms with relevance to disputes in named entities. Preprocessing named entities will allow similar terms to be more clustered together by reducing cosine distance between sentence vectors.

Date: "November" "12 November 2017" "The first of April" "In 2 weeks"

Time: "15h01" "Evening"

Time frequency: "at the beginning of each month" "Paid weekly"

Money: "500$" "five hundred dollars"

Samuel-Campbell commented 7 years ago

Creating a corpus

Technology: http://brat.nlplab.org/

French datasets are more sparse then English ones. This tool allows the user to quickly annotate text in order to give it meaning in a machine readable format. Namely, this application can be applied to precisely the dataset that we are using for this project which will increase the confidence .

arekmano commented 7 years ago

For the preprocessing phase, it should be enough to replace the matching terms in the string with "temps", "temps récurrent", "argent" and "date"

arekmano commented 7 years ago

I've changed to using DBSCAN as our initial clustering implementation, as it has a better clustering output. However, there are multiple facts clusters that are not dense enough to have their own category. We must come up with a solution to this.

Samuel-Campbell commented 7 years ago

My preprocessing is catching Money, time frequency, time, date, relative time with no errors so far.

example: 12 novembre 2017 --> date date date --> date

If a noun phrase has many words describing the same named entity then I reduce it to only 1 word so that I do not give this one meaning too much weight


NER MATRIX

The named entity model was trained using n-gram with window size = 1. Window size = 2 gave inaccurate results

Samuel-Campbell commented 7 years ago

example output (copy pasted)

['moment', 'dépôt', 'demande', 'date'] au moment du dépôt de sa demande le 17 juin 2015

['allègue', 'retard', 'paiement', 'somme', 'argent'] allègue un retard de paiement pour une somme de 2 301 $

Samuel-Campbell commented 7 years ago

I'm missing corpus annotation concerning the 'Time' named entity. I will work on that by the end of the week/weekend. Shouldn't take more than 1-2 hours.

arekmano commented 7 years ago

I'm using solution 2 on here: https://stackoverflow.com/questions/29760935/how-to-get-vector-for-a-sentence-from-the-word2vec-of-tokens-in-sentence

Samuel-Campbell commented 6 years ago

Changes

My model is working now. It requires more annotation for all the entities to be properly recognized. Should be achievable with 1-2 more hour of work.

Arek mentioned to consider: 12 novembre 2017 et 21 decembre --> date date as opposed to 12 novembre 2017 et 21 decembre --> date date --> date

So I will include both so we can play with it. In my opinion a sentence such as: 'J'etais au magasin le 1er septembre et le 2 novembre' should map to <j'etais au magasin date> I do not believe mapping it to <j'etais au magasin date date> brings any additional value to the topics found in the sentence. Furthermore this method will add unwanted weight to the sentence vector.

Code

The code: 1- Seperates facts from decisions 2- maps named entities 3- stores original sentence and ner_sentence 4- *will allow for configurability in terms of grouping ner together as mentioned above

Samuel-Campbell commented 6 years ago

Looking at either:

I think affinity propagation is already similar to k-means. So I will go with HDBSCAN since it seems a little better than DBSCAN

http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

arekmano commented 6 years ago

20171103_165231

Samuel-Campbell commented 6 years ago

HDSCAN METRICS

METRICS

Number_of_clusters: 6 Number_of_files used: 10 Noise: 7 Number_of_clusters: 26 Number_of_files used: 100 Noise: 33 Number_of_clusters: 84 Number_of_files used: 500 Noise: 132 Number_of_clusters: 144 Number_of_files used: 1000 Noise: 268 Number_of_clusters: 269 Number_of_files used: 2000 Noise: 434 Number_of_clusters: 465 Number_of_files used: 4000 Noise: 800 Number_of_clusters: 794 Number_of_files used: 8000 Noise: 1381 Number_of_clusters: 1407 Number_of_files used: 16000 Noise: 2485 Number_of_clusters: 2470 Number_of_files used: 32000 Noise: 4322

CONCLUSION



Number_of_clusters: 0 Number_of_files used: 10 Noise: 49 Number_of_clusters: 9 Number_of_files used: 100 Noise: 140 Number_of_clusters: 35 Number_of_files used: 500 Noise: 380 Number_of_clusters: 54 Number_of_files used: 1000 Noise: 684 Number_of_clusters: 82 Number_of_files used: 2000 Noise: 1170 Number_of_clusters: 122 Number_of_files used: 4000 Noise: 1948 Number_of_clusters: 197 Number_of_files used: 8000 Noise: 3215 Number_of_clusters: 325 Number_of_files used: 16000 Noise: 5518


using all facts

Number_of_clusters: 31 Number_of_files used: 10 Noise: 54 Number_of_clusters: 279 Number_of_files used: 100 Noise: 665 Number_of_clusters: 1048 Number_of_files used: 500 Noise: 3353 Number_of_clusters: 1977 Number_of_files used: 1000 Noise: 6190 Number_of_clusters: 3826 Number_of_files used: 2000 Noise: 12453 Number_of_clusters: 6869 Number_of_files used: 4000 Noise: 22380

Conclusoin


Added Comma Separator

Number_of_clusters: 8 Number_of_files used: 10 Noise: 12 Number_of_clusters: 35 Number_of_files used: 100 Noise: 44 Number_of_clusters: 111 Number_of_files used: 500 Noise: 140 Number_of_clusters: 184 Number_of_files used: 1000 Noise: 268 Number_of_clusters: 306 Number_of_files used: 2000 Noise: 449 Number_of_clusters: 530 Number_of_files used: 4000 Noise: 781 Number_of_clusters: 863 Number_of_files used: 8000 Noise: 1313 Number_of_clusters: 1459 Number_of_files used: 16000 Noise: 2308 Number_of_clusters: 2494 Number_of_files used: 32000 Noise: 3889

Conclusion

Samuel-Campbell commented 6 years ago

Affinity Propagation

arekmano commented 6 years ago

What

Evaluated the appropriateness of different word2vec pre-trained models

Word2Vec model evaluation

Number of missing words | Vector Model Name 22049 | frWac_no_postag_no_phrase_500_cbow_cut100 (Stemmed words) 16923 | frWac_no_postag_no_phrase_500_cbow_cut100 (Non stemmed words) 22049 | frWac_no_postag_no_phrase_500_skip_cut100 (Stemmed words) 16923 | frWac_no_postag_no_phrase_500_skip_cut100 (Non stemmed words) 19542 | frWac_no_postag_no_phrase_700_skip_cut50 (Stemmed words) 14933 | frWac_no_postag_no_phrase_700_skip_cut50 (Non stemmed words) 28455 | frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100 (Stemmed words) 15129 | frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100 (Non stemmed words) 28455 | frWiki_no_lem_no_postag_no_phrase_1000_skip_cut100 (Stemmed words) 15129 | frWiki_no_lem_no_postag_no_phrase_1000_skip_cut100 (Non stemmed words) 30010 | frWiki_no_phrase_no_postag_1000_skip_cut100 (Stemmed words) 23630 | frWiki_no_phrase_no_postag_1000_skip_cut100 (Non stemmed words) 30010 | frWiki_no_phrase_no_postag_700_cbow_cut100 (Stemmed words) 23630 | frWiki_no_phrase_no_postag_700_cbow_cut100 (Non stemmed words) 21708 | frWac_non_lem_no_postag_no_phrase_500_skip_cut100 (Stemmed words) 9066 | frWac_non_lem_no_postag_no_phrase_500_skip_cut100 (Non stemmed words)

Conclusion

The most appropriate Vector model for our dataset is: frWac_non_lem_no_postag_no_phrase_500_skip_cut100, found here (http://embeddings.org/frWac_non_lem_no_postag_no_phrase_500_skip_cut100.bin)

Samuel-Campbell commented 6 years ago

Manifold Learning (dimension reduction, noise filtering)

Objective

Raw data

figure_1

Outcomes

figure_1





per40


l200


f20


n

TaimoorRana commented 6 years ago

Problem Approach

The good: Majority of the cluster are good. Obviously, there's noise as well.

Improvement Ideas

Samuel-Campbell commented 6 years ago

Manifold Metrics

Number_of_clusters: 288 Number_of_files used: 1000 Noise: 437 Learning rate: 200 Perplexity: 30 Number_of_clusters: 268 Number_of_files used: 1000 Noise: 448 Learning rate: 200 Perplexity: 28 Number_of_clusters: 271 Number_of_files used: 1000 Noise: 464 Learning rate: 200 Perplexity: 26 Number_of_clusters: 265 Number_of_files used: 1000 Noise: 449 Learning rate: 200 Perplexity: 24 Number_of_clusters: 254 Number_of_files used: 1000 Noise: 406 Learning rate: 200 Perplexity: 22 Number_of_clusters: 255 Number_of_files used: 1000 Noise: 342 Learning rate: 200 Perplexity: 20 Number_of_clusters: 283 Number_of_files used: 1000 Noise: 484 Learning rate: 200 Perplexity: 30 Number_of_clusters: 272 Number_of_files used: 1000 Noise: 540 Learning rate: 300 Perplexity: 30 Number_of_clusters: 262 Number_of_files used: 1000 Noise: 519 Learning rate: 400 Perplexity: 30 Number_of_clusters: 270 Number_of_files used: 1000 Noise: 486 Learning rate: 500 Perplexity: 30 Number_of_clusters: 271 Number_of_files used: 1000 Noise: 443 Learning rate: 600 Perplexity: 30 Number_of_clusters: 265 Number_of_files used: 1000 Noise: 418 Learning rate: 700 Perplexity: 30 Number_of_clusters: 272 Number_of_files used: 1000 Noise: 342 Learning rate: 200 Perplexity: 20 Number_of_clusters: 259 Number_of_files used: 1000 Noise: 376 Learning rate: 300 Perplexity: 20 Number_of_clusters: 254 Number_of_files used: 1000 Noise: 352 Learning rate: 400 Perplexity: 20 Number_of_clusters: 261 Number_of_files used: 1000 Noise: 362 Learning rate: 500 Perplexity: 20 Number_of_clusters: 268 Number_of_files used: 1000 Noise: 384 Learning rate: 600 Perplexity: 20 Number_of_clusters: 314 Number_of_files used: 1000 Noise: 401 Learning rate: 700 Perplexity: 20 Number_of_clusters: 257 Number_of_files used: 1000 Noise: 410 Learning rate: 700 Perplexity: 30 Number_of_clusters: 263 Number_of_files used: 1000 Noise: 392 Learning rate: 700 Perplexity: 28 Number_of_clusters: 258 Number_of_files used: 1000 Noise: 405 Learning rate: 700 Perplexity: 26 Number_of_clusters: 274 Number_of_files used: 1000 Noise: 369 Learning rate: 700 Perplexity: 24 Number_of_clusters: 270 Number_of_files used: 1000 Noise: 341 Learning rate: 700 Perplexity: 22 Number_of_clusters: 284 Number_of_files used: 1000 Noise: 528 Learning rate: 700 Perplexity: 20 Number_of_clusters: 265 Number_of_files used: 1000 Noise: 458 Learning rate: 700 Perplexity: 30 Number_of_clusters: 264 Number_of_files used: 1000 Noise: 472 Learning rate: 600 Perplexity: 28 Number_of_clusters: 275 Number_of_files used: 1000 Noise: 448 Learning rate: 500 Perplexity: 26 Number_of_clusters: 272 Number_of_files used: 1000 Noise: 443 Learning rate: 400 Perplexity: 24 Number_of_clusters: 266 Number_of_files used: 1000 Noise: 372 Learning rate: 300 Perplexity: 22 Number_of_clusters: 275 Number_of_files used: 1000 Noise: 391 Learning rate: 200 Perplexity: 20 Number_of_clusters: 276 Number_of_files used: 1000 Noise: 467 Learning rate: 200 Perplexity: 30 Number_of_clusters: 272 Number_of_files used: 1000 Noise: 477 Learning rate: 300 Perplexity: 28 Number_of_clusters: 267 Number_of_files used: 1000 Noise: 427 Learning rate: 400 Perplexity: 26 Number_of_clusters: 273 Number_of_files used: 1000 Noise: 391 Learning rate: 500 Perplexity: 24 Number_of_clusters: 247 Number_of_files used: 1000 Noise: 403 Learning rate: 600 Perplexity: 22 Number_of_clusters: 291 Number_of_files used: 1000 Noise: 430 Learning rate: 700 Perplexity: 20 Number_of_clusters: 284 Number_of_files used: 1000 Noise: 482 Learning rate: 200 Perplexity: 30 Number_of_clusters: 272 Number_of_files used: 1000 Noise: 455 Learning rate: 200 Perplexity: 32 Number_of_clusters: 285 Number_of_files used: 1000 Noise: 421 Learning rate: 200 Perplexity: 34 Number_of_clusters: 283 Number_of_files used: 1000 Noise: 495 Learning rate: 200 Perplexity: 36 Number_of_clusters: 293 Number_of_files used: 1000 Noise: 420 Learning rate: 200 Perplexity: 38 Number_of_clusters: 291 Number_of_files used: 1000 Noise: 465 Learning rate: 200 Perplexity: 40 Number_of_clusters: 209 Number_of_files used: 1000 Noise: 763 Learning rate: 200 Perplexity: 5 Number_of_clusters: 198 Number_of_files used: 1000 Noise: 217 Learning rate: 200 Perplexity: 10 Number_of_clusters: 201 Number_of_files used: 1000 Noise: 250 Learning rate: 200 Perplexity: 12 Number_of_clusters: 215 Number_of_files used: 1000 Noise: 271 Learning rate: 200 Perplexity: 14 Number_of_clusters: 238 Number_of_files used: 1000 Noise: 241 Learning rate: 200 Perplexity: 16 Number_of_clusters: 255 Number_of_files used: 1000 Noise: 345 Learning rate: 200 Perplexity: 18

Conclusion

Reducing perplexity reduces noise Learning rate barely has any impact

arekmano commented 6 years ago

DBSCAN Tests

10000 precedents

182 categories 5534 seconds clustering time 66 seconds vectorization time 29779/59317 noise

10000 precedents (with synonyms)

182 categories 5994 seconds clustering time 89 seconds vectorization time 39220/72805 noise

arekmano commented 6 years ago

DBSCAN Parameter testing

1k precedents, 3 min samples, 0.5 eps

105 clusters 5700 / 6560 noise

1k precedents, 3 min samples, 0.3 eps (sentence split)

125 clusters 3700 / 6560 noise

1k precedents, 4 min samples, 0.5 eps

48 clusters 4000 / 5059 noise

1k precedents, 4 min samples, 0.3 eps (sentence split)

79 clusters 4500 / 6560 noise

1k precedents, 4 min samples, 0.2 eps (sentence split)

86 clusters 4700 / 6560 noise

1k precedents, 2 min samples, 0.5 eps

167 clusters 3400 / 5059 noise

1k precedents, 2 min samples, 0.5 eps (sentence+comma split)

418 clusters invalid clusters (dates only, wierd time things, etc) 6646 / 10294 noise

1k precedents, 2 min samples, 0.5 eps (sentence split)

262 clusters 5400 / 6560 noise

1k precedents, 2 min samples, 0.6 eps (sentence split)

184 clusters 5800 / 6560 noise

1k precedents, 2 min samples, 0.7 eps (sentence split)

106 clusters 6100 / 6560 noise

1k precedents, 2 min samples, 0.4 eps (sentence split)

270 clusters 4300 / 6560 noise

1k precedents, 2 min samples, 0.3 eps (sentence split)

278 clusters 4150 / 6560 noise

Samuel-Campbell commented 6 years ago

Optimizing for DBSCAN FACTS



Conclusion

Samuel-Campbell commented 6 years ago

Optimizing DBSCAN Decisions

Samuel-Campbell commented 6 years ago

Running Out of Memory

arekmano commented 6 years ago

Received verbal confirmation about closing story during meeting. Pinging @naregeff to leave a comment

naregeff commented 6 years ago

Signing off, it works!