dkpro / dkpro-c4corpus

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
https://dkpro.github.io/dkpro-c4corpus
Apache License 2.0
50 stars 8 forks source link

Fix O(n!) in tag depth issue #28

Open tfmorris opened 8 years ago

tfmorris commented 8 years ago

This fixes the three issues mentioned above:

tfmorris commented 8 years ago

I've revised this PR to complete the fix for #27 and also fix #29 & #30.

tfmorris commented 8 years ago

I've added an example of the output from the new version for people to look at:

https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt

Subjectively (and with a sample size of 1), the new version seems substantially better, although that wasn't my primary goal initially. The word count went from 3168 to 5641 as compared to 5804 for the gold standard (and 3560 for the Python version).

One issue that I think still needs to be fixed is mid-word tag boundaries, because every text segment gets added with a space before it, splitting these words. This exists in both the current and new versions. I'm actually not convinced either way is conclusively better than the other, so I'm inclined to leave this unchanged.

tfmorris commented 8 years ago

I added a fix for #36 and fixed some other issues, but this needs to be rebased against the current master and is missing a couple of later commits that significantly improved performance, but I'll hold off on doing any more work on this as a separate task unless there's interest in reviewing it.

I decided that I, personally, wanted something closer to the Python JusText implementation because it's simpler, easier to understand, and performs better. If you guys want to stick closer to the existing Java, I can help fix some of the most egregious problems with it.

If you want to go the route of aligning with the Python implementation, a bunch the intermediary stuff that I did can be squashed/eliminated because it's not relevant.