ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document
Apache License 2.0
2.15k stars 223 forks source link

Frontend support #65

Open johipsum opened 7 years ago

johipsum commented 7 years ago

solves #62

because i needed it quickly I transformed the txt files to JSONs and introduced a stopwords require function. this can be bundled with webpack, browserify, etc. and you can use it in browsers or like me in a aws lambda function (bundled with webpack). works for me 🙂 but let me know if you have a better idea or cleaner way to do it

knod commented 7 years ago

Did you try bundling this with browserify? I don't think it'll work - I tried the same thing. I haven't gotten plain browserify to work with variables, only with explicit strings, so something like the stopwords loader I see in the commits won't work.

johipsum commented 7 years ago

@knod unfortunately I only tested it with webpack. works just fine because of webpacks context feature. I just created a repository with a working example johipsum/unfluff-browser-test.

I didn't know that browserify can not handle dynamic requires... even if adding tons of require statements for every single stopwords json sounds bad, is this probably the "best"/easiest way to support browserify... or does someone have a better idea?

knod commented 7 years ago

@johipsum: From what I understand, the only way to support browserify in this kind of situation is by using additional modules. I think there's no ideal solution here, unfortunately, but it'd be great to hear any additional ideas, or even a confirmation that this is the case.

I'd also like thoughts on whether converting the stopwords files to JSON in order to include them in the loop is very different than just adding them, as an array, to one object. Do stopwords librarys often offer their data as JSON? Or are they usually .txt files that would need to be converted?

Another possible option (haven't really used makefiles much) may be to convert and combine the .txt files into a JSON object file during make, since make has to be run every time the solution is changed anyway. Is that feasible?

johipsum commented 7 years ago

I updated the stopwords-loader in order to support browserify https://github.com/ageitgey/node-unfluff/pull/65/commits/10c9ac95ef13f425b9e84ff2f8eb0017f74e3e6d ... an additional make task to create the JSONs would be great! we could also generate the stopwords-loader via make ...

knod commented 7 years ago

I also made a couple pull requests with different options. One was very similar to your implementation. Great minds...

mikhailbot commented 7 years ago

@johipsum care to share how you got your unfluff fork to run in Lambda? I keep getting timeouts when installing your fork via NPM.

johipsum commented 7 years ago

@mikhaildelport the default timeout of a Lambda function is 3 seconds. Maybe your unfluff function needs more. Have you tried to increase the allowed time for your lambda execution?

mikhailbot commented 7 years ago

@johipsum yeah, I bumped it up to 5 seconds with no luck, and if it's that slow it's also mostly useless sadly! Running locally it finishes under a second. Here's the quick and dirty code I used to test it.

https://gist.github.com/mikhaildelport/28060909bbe276d537b328e36142f23b

Edit: So I bumped the timeout to 30 seconds just to see, and it finally completed in 5.5 seconds. Not sure why it's so slow. Is it getting the HTML (in which case I'll move that to the client) or is it the unfluff process?

Edit 2: Logs show it's unfluff process sadly.

2017-03-20T12:39:59.329Z    54440e07-0d6a-11e7-aec5-0be0bb66f6c7    unfluffing...
2017-03-20T12:39:59.743Z    54440e07-0d6a-11e7-aec5-0be0bb66f6c7    Got HTML
2017-03-20T12:40:04.902Z    54440e07-0d6a-11e7-aec5-0be0bb66f6c7    Got unfluffed
johipsum commented 7 years ago

@mikhaildelport maybe you can try the lazy extractors ... my lambda looks more or less like yours, except that i use the lazy functions, and its almost as fast as on my local machine.

mikhailbot commented 7 years ago

@johipsum I'll keep that in mind! I found a web parser API that works for what I want so I'm going with it for now, thanks for your help!