chimbori / crux

Crux offers a flexible plugin-based API & implementation to extract interesting information from Web pages.
Apache License 2.0
239 stars 43 forks source link

Add hint to keep specific elements #14

Closed sigpwned closed 5 years ago

sigpwned commented 5 years ago

I use crux to extract article text. It works great! It's been a huge aid to build products that process article data.

However, we've noticed that sometimes crux will discard content that we wish it would keep. For example, in this article, we would like to retain all the links to twitter statuses, (e.g., https://twitter.com/StationCDRKelly/status/591594008046084096).

In 2.1.0, the default scrape looks like this:

<p><a href="http://www.mashable.com/category/nasa">NASA</a> astronaut Scott Kelly — one of two people participating the International Space Station's <a href="https://mashable.com/2015/03/27/nasa-scott-kelly-a-year-in-space/">first yearlong mission</a> — tweeted out a picture of his room on the ISS, and it is filled with devices. </p> 
<p>Kelly’s photo shows two computers and a tablet set up on a platform across from his green sleeping bag with a restraint that helps keep him from floating around the station as he sleeps.</p> 
<p></p>
<p>See also: <a href="https://mashable.com/2015/04/04/scott-kelly-space-photo/">The view from astronaut Scott Kelly's new home 200 miles above Earth</a></p>
<p></p> 
<p>One of his computers appears to show a picture of Earth; the other is being used to check out news on CNN.</p> 
<blockquote> 
 <p>My <a href="https://twitter.com/hashtag/bedroom?src=hash">#bedroom</a> aboard <a href="https://twitter.com/hashtag/ISS?src=hash">#ISS</a>. All the comforts of <a href="https://twitter.com/hashtag/home?src=hash">#home</a>. Well, most of them. <a href="https://twitter.com/hashtag/YearInSpace?src=hash">#YearInSpace</a> <a href="http://t.co/2Ur09qccLI">pic.twitter.com/2Ur09qccLI</a></p> 
</blockquote> 
<p>The back wall of Kelly’s room is filled with little bits of Velcro, where he can store things to keep them from flying around in the weightless environment of the ISS. At the time the photo was taken, it looks like he was using the Velcro to store a mission patch and a flashlight.</p> 
<p>This isn’t Kelly’s first time living on the space station. He also lived and worked aboard the space laboratory for six months before flying up to the station for his yearlong mission with Russian cosmonaut Mikhail Kornienko.</p> 
<p>Kelly apparently had some trouble sleeping in weightlessness when he flew to space for earlier missions, but he having an easier time adjusting to his new sleeping arrangements this time around:</p> 
<blockquote> 
 <p>.<a href="https://twitter.com/Skamenca">@Skamenca</a> Sleep better on <a href="https://twitter.com/Space_Station">@space_station</a> this time than last. Disorienting when I wake &amp; it's dark. No gravity to tell which way is ↑ or ↓</p> 
</blockquote> 
<p>Kelly's twin brother — <a href="https://mashable.com/2014/03/11/nasa-astronaut-twin-study/">Mark Kelly</a> — is participating in his brother's mission from the ground. NASA mission controllers are monitoring Mark throughout Scott's mission to be able to compare how Scott's body changes during his time in orbit and just after he comes back to Earth in March 2016.</p>

This is a really good scrape. It contains all the text we want, and none of the text we don't. However, it discards the twitter links we want to track.

We could just pull all the twitter links from the whole document ahead of time, but we're only interested in the article content. We only want the twitter links that appear inside the article.

We need a way to ask crux to keep these very specific Twitter links. We could try to modify the crux internals to keep special whitelisted nodes, but crux already works really well, and other peoples' needs are different from ours, so any change we made probably wouldn't be a net positive on the whole. Instead, what we need is a way to communicate to the crux internals that these specific tags are important to us so that crux will keep them. Ideally, we would do that in such a way that every user could decide what's important for themselves, and crux would adapt accordingly. If we could add this feature so that users who don't need it can continue to use the library as if nothing had changed, then even better!

My team explored a few options -- a predicate function for identifying important nodes, a complex callback, etc. -- but we landed on a document annotation approach. It allows multiple pieces of code to collaborate to decide what nodes are important.

This patch adds a "hint" attribute, crux-important, that allows users to mark individual nodes as "important" and not to be removed. I would never expect this attribute to appear in HTML "in the wild," but it's useful when processing content. For example, to keep the twitter links, we could use code like this:

Document doc=Jsoup.parse(html);

// Find any twitter links we want to keep
List<Element> twitterLinks=doc.getElementsByTag("a").stream()
    .filter(a -> a.hasAttr("href"))
    .filter(a -> isTwitterLink(a.attr("href")))
    .collect(Collectors.toList());
for(Element twitterLink : twitterLinks)
    twitterLink.attr("crux-important", "true");

extracted = ArticleExtractor.with(url.toString(), doc)
    .extractContent()
    .extractMetadata()
    .article();

Here, we pre-parse the document, pull out any social links, label them as important, and then use crux to extract the main content from this modified document.

The patch does not change how crux chooses the "top node" for the scrape, so the same element will get chosen no matter how many nodes you label as important and no matter where in the document those important nodes are. Once the top node is chosen, the changes to PostprocessHelpers will prevent the following nodes from being removed from under the top node: (a) nodes that are themselves important; (b) nodes that have an important ancestor. That way, for users who don't use this feature, nothing has changed, but for users who do use this feature, we can preserve and extract only those important nodes under top node. This allows us to extract the twitter links from inside the article content consistently and reliably.

The patch also includes tests to make sure that the discard avoidance behavior works.

The updated library provides the following scrape:

<p><a href="http://www.mashable.com/category/nasa">NASA</a> astronaut Scott Kelly — one of two people participating the International Space Station's <a href="https://mashable.com/2015/03/27/nasa-scott-kelly-a-year-in-space/">first yearlong mission</a> — tweeted out a picture of his room on the ISS, and it is filled with devices. </p> 
<p>Kelly’s photo shows two computers and a tablet set up on a platform across from his green sleeping bag with a restraint that helps keep him from floating around the station as he sleeps.</p> 
<p></p>
<p>See also: <a href="https://mashable.com/2015/04/04/scott-kelly-space-photo/">The view from astronaut Scott Kelly's new home 200 miles above Earth</a></p>
<p></p> 
<p>One of his computers appears to show a picture of Earth; the other is being used to check out news on CNN.</p> 
 <p>My <a href="https://twitter.com/hashtag/bedroom?src=hash">#bedroom</a> aboard <a href="https://twitter.com/hashtag/ISS?src=hash">#ISS</a>. All the comforts of <a href="https://twitter.com/hashtag/home?src=hash">#home</a>. Well, most of them. <a href="https://twitter.com/hashtag/YearInSpace?src=hash">#YearInSpace</a> <a href="http://t.co/2Ur09qccLI">pic.twitter.com/2Ur09qccLI</a></p> 
 <p>— Scott Kelly (@StationCDRKelly) <a href="https://twitter.com/StationCDRKelly/status/591594008046084096">April 24, 2015</a></p> 
<p>The back wall of Kelly’s room is filled with little bits of Velcro, where he can store things to keep them from flying around in the weightless environment of the ISS. At the time the photo was taken, it looks like he was using the Velcro to store a mission patch and a flashlight.</p> 
<p>This isn’t Kelly’s first time living on the space station. He also lived and worked aboard the space laboratory for six months before flying up to the station for his yearlong mission with Russian cosmonaut Mikhail Kornienko.</p> 
<p>Kelly apparently had some trouble sleeping in weightlessness when he flew to space for earlier missions, but he having an easier time adjusting to his new sleeping arrangements this time around:</p> 
<p>.<a href="https://twitter.com/Skamenca">@Skamenca</a> Sleep better on <a href="https://twitter.com/Space_Station">@space_station</a> this time than last. Disorienting when I wake &amp; it's dark. No gravity to tell which way is ↑ or ↓</p> 
<p>— Scott Kelly (@StationCDRKelly) <a href="https://twitter.com/StationCDRKelly/status/591334283081560064">April 23, 2015</a></p> 
<p>Kelly's twin brother — <a href="https://mashable.com/2014/03/11/nasa-astronaut-twin-study/">Mark Kelly</a> — is participating in his brother's mission from the ground. NASA mission controllers are monitoring Mark throughout Scott's mission to be able to compare how Scott's body changes during his time in orbit and just after he comes back to Earth in March 2016.</p>

The twitter links are preserved. Success!

Hopefully this all makes sense. Please let me know if there are any issues with this patch! My team is currently running a forked version of crux in production. It's working well, so we'd like to share this work back to the community. :)

CLAassistant commented 5 years ago

CLA assistant check
All committers have signed the CLA.

sigpwned commented 5 years ago

OK! I've made the following changes to address your (great!) comments:

I have not made these changes yet:

  1. Move shouldKeep() check to Log.printAndRemove, because I'm waiting for you to confirm that you'd like to make that change. Note that if we go to a pre-computed approach, we'd need to change the signature of Log.printAndRemove by passing along the precompute to make it work.
  2. Make shouldKeep() more efficient, because I'm waiting for your feedback on my suggestion.

Thank you for all the feedback! I'm looking forward to hearing what you think about the rest. Really glad this change makes sense to you! :)

chimbori commented 5 years ago

Looks great to me, modulo the issues we’ve talked about (and we’re in agreement about those too).

Unless you’d like to add any more performance optimizations or correctness fixes (e.g. walking down in addition to up), this PR is good to go. Just let me know if you want to add more or if I should merge this as-is.

sigpwned commented 5 years ago

Great! I've already added the logic to look down as well as up. I'll follow up with another PR over the next couple of days to implement the performance improvement. If you're good to merge, I'm good to merge too!