EEXCESS / c4

C4 - Cultural and sCientific Content in Context - is the EEXCESS context detection framework written in JavaScript. It provides supporting functionionality to enable easy user mining and querying for all EEXCESS clients. It supports for example Named Entity Recognition using the DOSeR Service, paragraph detection, Citation Buidling etc.
http://eexcess.eu/
1 stars 1 forks source link

[Paragraph Detection] Should detected paragraphs be allowed to be nested? #2

Open chseifert opened 9 years ago

chseifert commented 9 years ago

Nested paragraphs are detected sometimes. See https://github.com/EEXCESS/jarvis/issues/6 for an example.

mgrani commented 9 years ago

The paragraphs are also very different in granularity. Some are short, some a larger. From my POV this is ok, but it influences the query accuracy. Maybe we need to decompose paragraphs further on a window of 1-3 sentences for obtaining good query terms.

schloett commented 9 years ago

Jarvis uses an adaptation of an earlier version of the paragraph detection. In the current version, this bug should not be present. But I agree, that it would make sense to subdivide long paragraphs. However, I am not sure, if a fixed length window would be appropriate or if other features could be exploited. For the task of finding subparagraphs, the markup might provide indicators, for the task of query generation, filtering the keywords (remove outliers) might be another solution.