Sotera / DatawakeDepot

Loopback web application for administration of Datawake networks
Apache License 2.0
9 stars 7 forks source link

HTML/Text Extractor #152

Open bwhiteman opened 8 years ago

bwhiteman commented 8 years ago

We should have a simple extractor that pulls the HTML and extracted body text of a document.

bmcdougald commented 8 years ago

@michaelsframe don't we already have this via the StanNER extractor? I'm also getting body content persisted and viewable in the trail url section

michaelsframe commented 8 years ago

I'm not sure what this is asking. The current trailing process pulls and sends the element of the page to the extractor.

What do you mean by document?

bwhiteman commented 8 years ago

Does it persist the HTML and the plain text "body"?

For future analytics, it would be good to have an extractor that pulls the main body text with something like unfluff and returns it. I had the stanbol extractor doing it but it shouldn't be there.

Once this is done, we can run text analytics on the corpus of pages that has been scraped.

michaelsframe commented 8 years ago

If we have the HTML it seems redundant to store the plain text body, doubles storage requirements and will slow down the URL insertion. Any analytic can retrieve and unfluff as necessary right?

bwhiteman commented 8 years ago

Yes, this is a question of which is more costly, to write text once read it may times or to extract the text every time.