Open chseifert opened 9 years ago
The paragraphs are also very different in granularity. Some are short, some a larger. From my POV this is ok, but it influences the query accuracy. Maybe we need to decompose paragraphs further on a window of 1-3 sentences for obtaining good query terms.
Jarvis uses an adaptation of an earlier version of the paragraph detection. In the current version, this bug should not be present. But I agree, that it would make sense to subdivide long paragraphs. However, I am not sure, if a fixed length window would be appropriate or if other features could be exploited. For the task of finding subparagraphs, the markup might provide indicators, for the task of query generation, filtering the keywords (remove outliers) might be another solution.
Nested paragraphs are detected sometimes. See https://github.com/EEXCESS/jarvis/issues/6 for an example.