dkpro / dkpro-c4corpus

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
https://dkpro.github.io/dkpro-c4corpus
Apache License 2.0
50 stars 8 forks source link

Make Java JusText implementation match Python and/or document differences #37

Open tfmorris opened 8 years ago

tfmorris commented 8 years ago

The differences between the Java and Python implementations were explained as largely an artifact of different XML parsers in a reply to #23, but I think there's more to it than that. I think the differences in the output of the two implementations should be explainable, and preferably should be improvements.

Some differences that I know of (some have already been reported as bugs) include:

Bottom line - what's implemented is not the JusText algorithm as documented.

tfmorris commented 8 years ago

Another significant difference is the paragraph detection/segmentation.

The Python implementation uses a very simple algorithm. Any "block level" element starts a new paragraph on both its start tag and its end tag. All text seen until another new paragraph is started gets added. Empty paragraphs are dropped.

The Java implementation uses a complicated backtracking algorithm that attempts to find a common ancestor between two nodes, then inspects all the elements along the paths to the common ancestor. This is O(n!) in tree depth which normally isn't a huge deal (although still wasteful), but in pathological deeply nested cases it can cause mappers to never complete.

I'm not sure what problem, real or imagined, the Java implementation was designed to solve, but the simple Python version seems to perform just fine (and in linear time & space).

habernal commented 8 years ago

Thanks for these very helpful insights! I agree that the implementation might be far from being optimal. Deviations are there, indeed, but since the results were comparable to the Python implementation, we didn't spend that much time on fine tuning (also given constrained project resources). I'm leaving that issue open to be considered for the next release.

reckart commented 8 years ago

We're happy about any contributions :)

tfmorris commented 8 years ago

I'm not sure the current results are comparable to the Python version. In my cursory spot checks, I saw some significant differences.

Did you look, for example, at https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt ? It currently preserves line breaks, which makes the diff difficult, but comparing Gold, Python, old Java, & new Java by eye shows some pretty significant differences.

BTW, in reference to my previous comment about ancestor processing for paragraph detection, I think I might have found the motivation for the fancy paragraph processing in the Java version. It looks suspiciously like this description of paragraphs from the HTML5 spec:

https://www.w3.org/TR/html/dom.html#paragraph