freme-project / freme-ner

Apache License 2.0
6 stars 1 forks source link

Importance of Entity Counts and Initializing Them for Disambiguation in Solr Index #128

Closed munnellg closed 8 years ago

munnellg commented 8 years ago

Based on my understanding of how Freme works, disambiguation candidates are selected by comparing surface forms to entity labels. The most frequent sense of an entity, which is to say the entity sense that has been spotted most often in the collection, is selected as the final disambiguation. So our disambiguation is entirely based on the most popular entities in our network? In other words, if we have two entries for Brad Pitt - the actor and the boxer - Freme will always pick the actor irrespective of the context of the article because there will be more stuff written about him in the literature?

That means the count parameter in Solr is really important. Doesn't that get a value of 1 for every label loaded from DBpedia though? So every candidate entity disambiguation is equally probable for a given surface form? Are we going to have to find a way to bias that?

sandroacoelho commented 8 years ago

Hi Gary,

Based on my understanding of how Freme works, disambiguation candidates are selected by >>comparing surface forms to entity labels. The most frequent sense of an entity, which is to say the >>entity sense that has been spotted most often in the collection, is selected as the final >>disambiguation. So our disambiguation is entirely based on the most popular entities in our >>network? In other words, if we have two entries for Brad Pitt - the actor and the boxer - Freme will >>always pick the actor irrespective of the context of the article because there will be more stuff >>written about him in the literature?

Yes. FREME-NER runs a two-step method to extract and linking entities. Our NER models extracts candidates (aka:"surface forms") and then we use Solr (TF/IDF) to linking based on occurrences stored there.

That means the count parameter in Solr is really important. Doesn't that get a value of 1 for every >>label loaded from DBpedia though? So every candidate entity disambiguation is equally probable >> for a given surface form? Are we going to have to find a way to bias that?

If we had access to small parts of texts related to the links, it is possible to improve a lot our disambiguation module. We can leverage on it to select the suitable link. This task is easy to do if we have these data.

munnellg commented 8 years ago

Cool. I think it's important to be aware of what Freme is doing out of the box and what modifications we can make to enhance the quality of our results.

Arguably what we have on our server at the moment could be considered a minimum effort installation for what we're trying to achieve. Making Freme a bit smarter will require a few more inputs.

Thanks so much for the clarification. It's really interesting learning about this!

jnehring commented 8 years ago

In case you want to implement another disambiguation you can use FREME NERs numLinks parameter. This will return multiple links to each entity. Then you can easily try out another disambiguation method on the client side without hacking the FREME NER code. When your disambiguation method works fine then it would be great if it would be implemented in FREME NER though.

jnehring commented 8 years ago

The questions are answered so I close this issue