ajitrajasekharan / bert_vector_clustering

Clustering learned BERT vectors for downstream tasks like unsupervised NER, unsupervised sentence embeddings etc.
MIT License
10 stars 5 forks source link

Label Cluster for new NER-Type in other Language #3

Open JanFreise opened 2 years ago

JanFreise commented 2 years ago

I don't quite get the directions in the readme for Creating a new boostrapped Labeling for Unsupervised NER in a different Language and for Different Labels/Terms.

Step 1: I emptied the files "labels.txt" and "bootstrap_entities.txt". Then i tried both for new boostrapped labeling:

a) just run with an empty seedword list b) created a new bootstrap_entities.txt with new seed words (all part of my vocab.txt)

Then i called run.sh with Option=1 and Threshold = 0 for vector generation + labeling them according to my seed words.

Upon finishing a LOT of files are written/updated. E.g. adaptive_debug_pivots.txt, inferred.txt, labels.txt, pivots.json, pivots.txt

In the Readme it says: "Cluster (run.sh with option 1 followed by 0) and then examine cluster pivots to label them. Then rerun clustering and select candidates from inferred.txt. "

So its not clear which file is meant here by "examine cluster pivots" to me.

Firstly i assumed i have to look at the adaptive_debug_pivots.txt. So i started to correct Labels in the file adaptive_debug_pivots.txt.

When i restart clusting again (with the same options as above - run.sh with option 1 followed by 0) the same outputs as in Step 1 are just regenerated identically again. So all my editing was simply overwritten. Inferred.txt basically always contains no entries at all. So i must be doing something wrong.

Then i checked the run.sh

python dist_v2.py pwd 0 vocab.txt bert_vectors.txt 0 results/labels.txt results/stats_dict.txt preserve_1_2_grams.txt glue_words.txt bootstrap_entities.txt

and figured that basically the bootstrap_entities.txt contains the pivot clusters. So im pretty much lost now.

Could you please specify more precisely how i can iteratively improve the labeling for the generated clusters?