Improve Nerd standoff - Githubissues

Aazhar commented 8 years ago

Actually, we select the first paragraphs, but it would be more fruitful to calculate the most significant concepts rather that pick them randomly.

kermitt2 commented 8 years ago

Each tool has its dedicated usage and should not be used for another purpose:

the keyterm extractor extracts the most significant/discriminant key terms, key concepts and wikipedia categories from an article as compared to the background collection. E.g. for doing facets of the most interesting concepts, this has to be used or the wikipedia catagories.
the NERD is dedicated to the exhaustive anotation of the concepts in a document for enabling semantic search - so it has to be used for search as the usual terms (the stems). The fact that only the abstract and the first paragraph were used before was simply due to cut the time given the deadline of the senate demo in february 2015 ;) The idea is to run it on the whole textual content in order to combine structural search, term search and semantic search.

Aazhar commented 8 years ago

Sure, but so far we're taking the first paragraphs (not necessarily the title and the abstract) and what I meant is that knowing that improvements have to be made on NERD , we can set a threshold(for instance the average/article) for the nerd_score and conf_score to avoid badly disambiguated context..

kermitt2 commented 8 years ago

We were taking the first paragraphs just because if time constraints for the demo last year! We should take the whole for the NERD… I thought I changed it at some point to take the whole document.

NERD is not weighting the concepts in term of significance, it's grobid-keyterm which is doing that using various distributional information. NERD is disambiguating locally and try to disambiguate all mentions. We can set a different threashold while indexing NERD annotations for instance if we want to improve precision but there will always be some noise at this level. The point is that for semantic search it's the accumulation of the matches that set the scores (tf/idf or BM25) so it should be robust to noise from a ranking perspective.

It is a bit difference with the query disambiguation maybe - less context and more sensitive to noise. Currently the pruning threasholds are the same, but it could be refine based on experiments depending on the mode of usage…

For the facets, concepts and categories from the keyterm annotator make more sense than NERD annotations because there are already a selection of the key aspect of a document.

anHALytics / anhalytics-core

Improve Nerd standoff #37