[Paragraph Detection] fails on twitter (when logged in)

chseifert commented 8 years ago

The following funny keywords are extracted on a Twitter page (https://twitter.com/eexcess) IF the user is LOGGED IN

The html code of the page (IF logged in) is attached twitter_html.txt

schloett commented 8 years ago

What's so funny about these keywords? ;) twitter Checking for visibility unfortunately is a real performance killer. I've integrated a version into c4 which seems acceptable in terms of computing time (~ up to 200ms, only tested with a few pages), but is not perfectly accurate (works fine for the provided twitter example). Checking for visibility can be switched off by passing the corresponding options parameter to the paragraphDetection method. Some numbers to get an intuition: ~20ms without check for visibility ~200ms with current implementation ~2000ms with in-depth check (still the last one does not capture opacity or similar)

schloett commented 8 years ago

@chseifert could you check if this issue is fixed and if yes close it?

chseifert commented 8 years ago

Reactivated my twitter account and checked. Works :)

EEXCESS / c4

[Paragraph Detection] fails on twitter (when logged in) #21