Franck-Dernoncourt / NeuroNER

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
http://neuroner.com
MIT License
1.7k stars 475 forks source link

Questions about ANN #11

Closed Gregory-Howard closed 7 years ago

Gregory-Howard commented 7 years ago

Hi ! It is possible to NeuroNER to learn from vocabulary and not expressions? I explain my idea. On your video (https://www.youtube.com/watch?v=BmRYkxumDvU) we can see token/word embedding. Can NeuroNER learn more from it ? Example : I have 30 000 different entities to learn (for example cities or universities). I can't give him 30 000 different expressions, same with 10% of entities, 3000 is a lot (if done by a human). Let's say I can generate 1000-2000 different expressions that will represent my valid set. I will miss a lot of information given by word vectors. But if I complete training with a bit of vocabulary (like 40% of 30 000) I will have enough sampling to find the others entities. Is this possible ?

Franck-Dernoncourt commented 7 years ago

I'm not sure if I understand your idea correctly. Do you have in mind to train with just a list of words (vocabulary) that have some entity labels, and then use it on sentences containing some of these words? If so, you can definitely try to use each word in the vocabulary along with their labels as a one-word training sentence, but I don't think it will perform very well on the test set that looks quite different from the train set. For example, one issue I can think of is the absence of negative examples in the training set (unless your list of words also contains negative examples); NeuroNER might learn to predict each word as always having some entity label and never O. Another issue might be that NeuroNER won't be able to learn the context in which the entities typically appear.

However, we have not tried this and it might be interesting to try. The quality of the results depends on your data so it may work well enough in your case. If you want to explore this direction, we recommend you to maybe disable the crf layer, by setting use_crf=False.

Gregory-Howard commented 7 years ago

Yeah, that's the point. I didn't think about the list of words training. For a real exemple I want NeuroNER to learn French Department. (https://fr.wikipedia.org/wiki/Liste_des_départements_français) But : Nord, Indre, Eure have several meanings, like streams or Cardinal directions. So I make a file with every department => one sentence. Then some exceptions, like

# In the train set : 
# The list of words
Ardèche|Department
Ain|Department
Indre|Departement
Nord|Department
...
# Expressions similar to the test set focused on tricky expressions
Au Nord|O de la ville se situe la rivière 
L'Indre|O prend sa source dans les montagnes 
Le departement du Nord|Departement est situé ...
L'Ain|Departement est un département ...

# In the test set
Ain|Departement, Indre|Departement # Some lists
La ville de ... en Ardèche|Department # Some other expressions that was not trained

I don't know if this will improve results but I think so. I will post some results I think.

Gregory-Howard commented 7 years ago

It did not worked :( It seems words before and after are more important than the word itself.

Franck-Dernoncourt commented 7 years ago

Too bad :( Thanks for the follow up!

On Jul 6, 2017 3:41 AM, "Grégory Howard" notifications@github.com wrote:

Closed #11 https://github.com/Franck-Dernoncourt/NeuroNER/issues/11.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Franck-Dernoncourt/NeuroNER/issues/11#event-1152298388, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA741IIUnB-Q3rhABQYAYoBJps2aEQsks5sLJ23gaJpZM4NdjFH .