anHALytics / anhalytics-core

Analytic platform for the HAL research archive (in development)
Apache License 2.0
13 stars 1 forks source link

Improve Nerd standoff #37

Open Aazhar opened 8 years ago

Aazhar commented 8 years ago

Actually, we select the first paragraphs, but it would be more fruitful to calculate the most significant concepts rather that pick them randomly.

kermitt2 commented 8 years ago

Each tool has its dedicated usage and should not be used for another purpose:

Aazhar commented 8 years ago

Sure, but so far we're taking the first paragraphs (not necessarily the title and the abstract) and what I meant is that knowing that improvements have to be made on NERD , we can set a threshold(for instance the average/article) for the nerd_score and conf_score to avoid badly disambiguated context..

kermitt2 commented 8 years ago

We were taking the first paragraphs just because if time constraints for the demo last year! We should take the whole for the NERD… I thought I changed it at some point to take the whole document.

NERD is not weighting the concepts in term of significance, it's grobid-keyterm which is doing that using various distributional information. NERD is disambiguating locally and try to disambiguate all mentions. We can set a different threashold while indexing NERD annotations for instance if we want to improve precision but there will always be some noise at this level. The point is that for semantic search it's the accumulation of the matches that set the scores (tf/idf or BM25) so it should be robust to noise from a ranking perspective.

It is a bit difference with the query disambiguation maybe - less context and more sensitive to noise. Currently the pruning threasholds are the same, but it could be refine based on experiments depending on the mode of usage…

For the facets, concepts and categories from the keyterm annotator make more sense than NERD annotations because there are already a selection of the key aspect of a document.