Fix O(n!) in tag depth issue

tfmorris commented 8 years ago

This fixes the three issues mentioned above:

27 - Allowing deeply nested document to be processed as well as speeding up processing in general. Rather than continually backtracking to all element ancestors and keeping the entire path, causing problems with O(n!) in both space & time, it now only keeps the most recent tag and directly tracks whether it's in a header or not.
As a related item, <a>, <span>, <em>, and <strong> were added to the list of "inline text" tags, so that the tags output in minimal HTML mode are more useful (and match more closely with the CleanEval gold standard).
29 - Handle non-UTF-8 encoded HTML files in the standalone boilerplate program. This is likely to significantly improve the CleanEval scores, so they should probably be re-run on the whole CleanEval corpus.
30 - Decode HTML entities - This is another improvement which makes the text more useful and which makes it more closely match the gold standard. I adjusted the © entity reference, but if there are other sections of code which assumes HTML entity-encoded text, they may need to be cleaned up too.

tfmorris commented 8 years ago

I've revised this PR to complete the fix for #27 and also fix #29 & #30.

tfmorris commented 8 years ago

I've added an example of the output from the new version for people to look at:

https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt

Subjectively (and with a sample size of 1), the new version seems substantially better, although that wasn't my primary goal initially. The word count went from 3168 to 5641 as compared to 5804 for the gold standard (and 3560 for the Python version).

One issue that I think still needs to be fixed is mid-word tag boundaries, because every text segment gets added with a space before it, splitting these words. This exists in both the current and new versions. I'm actually not convinced either way is conclusively better than the other, so I'm inclined to leave this unchanged.

tfmorris commented 8 years ago

I added a fix for #36 and fixed some other issues, but this needs to be rebased against the current master and is missing a couple of later commits that significantly improved performance, but I'll hold off on doing any more work on this as a separate task unless there's interest in reviewing it.

I decided that I, personally, wanted something closer to the Python JusText implementation because it's simpler, easier to understand, and performs better. If you guys want to stick closer to the existing Java, I can help fix some of the most egregious problems with it.

If you want to go the route of aligning with the Python implementation, a bunch the intermediary stuff that I did can be squashed/eliminated because it's not relevant.

dkpro / dkpro-c4corpus

Fix O(n!) in tag depth issue #28

29 - Handle non-UTF-8 encoded HTML files in the standalone boilerplate program. This is likely to significantly improve the CleanEval scores, so they should probably be re-run on the whole CleanEval corpus.

30 - Decode HTML entities - This is another improvement which makes the text more useful and which makes it more closely match the gold standard. I adjusted the `©` entity reference, but if there are other sections of code which assumes HTML entity-encoded text, they may need to be cleaned up too.