101companies / 101dev

Tools and documentation for 101companies
http://101companies.org
GNU General Public License v3.0
2 stars 2 forks source link

HTML fact extractor does not support not marked single tags like <br> #30

Open todeslord opened 10 years ago

todeslord commented 10 years ago

HTML fact extractor does not support not marked single tags like
. The used parser cannot distinguish start tags from single tags. (The used SAX-Parser is not supporting single-tags cor- rectly. A <br> is leading to a wrong fragment file whereas <br/> is)

Possible solutions: -Find another parser -write a parser that finds single tags -use a preprocessor that converts single tags to the <br/> style. ...

Issue from the Fact Extraction paper of June 22.

todeslord commented 10 years ago

I have done a workaround. I introduced a preprocessing function that substitutes <br> with <br />. Not really elegant, but it works. We have to keep in mind that there are possibly more singletags in the future causing trouble.

https://github.com/101companies/101repo/commit/19dd1d83fa8bb8127a50892012356849bf0b0e0e