chimbori / crux

Crux offers a flexible plugin-based API & implementation to extract interesting information from Web pages.
Apache License 2.0
239 stars 43 forks source link

Preserve <br>? #4

Open 8enmann opened 5 years ago

8enmann commented 5 years ago

I tried extracting https://us13.campaign-archive.com/?u=67bd06787e84d73db24fb0aa5&id=c3e998f811&e=7bc177b38a

and then rendering it. Looks broken since the
are extracted. Can I send a PR to add them back? What's the best way to do this?

chimbori commented 5 years ago

Sure, PRs are always welcome! See CONTRIBUTIONS for how to send one.

At a minimum, please add a new test with the content you are trying to parse, and then modify the source so that your new text is parsed correctly (while ensuring that existing tests also continue to pass.)

8enmann commented 5 years ago

Decided to use mozilla/readability instead since it's a bit more robust. Thanks!

chimbori commented 5 years ago

True, it’s been around longer & has more maintainers.

Though, it’s written in JavaScript, which precludes many usages of it, especially in Android apps. Crux’s predecessor, Snacktory, started as a Java clone of Readability.

8enmann commented 5 years ago

In Android I'm injecting it into a WebView and extracting output with webview.evaluateJavascript 😅