EEXCESS / c4

C4 - Cultural and sCientific Content in Context - is the EEXCESS context detection framework written in JavaScript. It provides supporting functionionality to enable easy user mining and querying for all EEXCESS clients. It supports for example Named Entity Recognition using the DOSeR Service, paragraph detection, Citation Buidling etc.
http://eexcess.eu/
1 stars 1 forks source link

[Paragraph Detection] fails on twitter (when logged in) #21

Closed chseifert closed 8 years ago

chseifert commented 8 years ago

The following funny keywords are extracted on a Twitter page (https://twitter.com/eexcess) IF the user is LOGGED IN

image001

The html code of the page (IF logged in) is attached twitter_html.txt

schloett commented 8 years ago

What's so funny about these keywords? ;) twitter Checking for visibility unfortunately is a real performance killer. I've integrated a version into c4 which seems acceptable in terms of computing time (~ up to 200ms, only tested with a few pages), but is not perfectly accurate (works fine for the provided twitter example). Checking for visibility can be switched off by passing the corresponding options parameter to the paragraphDetection method. Some numbers to get an intuition: ~20ms without check for visibility ~200ms with current implementation ~2000ms with in-depth check (still the last one does not capture opacity or similar)

schloett commented 8 years ago

@chseifert could you check if this issue is fixed and if yes close it?

chseifert commented 8 years ago

Reactivated my twitter account and checked. Works :)