Closed sigpwned closed 5 years ago
OK! I've made the following changes to address your (great!) comments:
crux-important
to crux-keep
crux-keep
, rather than value of crux-keep
isImportant()
to shouldKeep()
shouldKeep()
to conform to style guidemashable2.html
)I have not made these changes yet:
shouldKeep()
check to Log.printAndRemove
, because I'm waiting for you to confirm that you'd like to make that change. Note that if we go to a pre-computed approach, we'd need to change the signature of Log.printAndRemove
by passing along the precompute to make it work.shouldKeep()
more efficient, because I'm waiting for your feedback on my suggestion.Thank you for all the feedback! I'm looking forward to hearing what you think about the rest. Really glad this change makes sense to you! :)
Looks great to me, modulo the issues we’ve talked about (and we’re in agreement about those too).
Unless you’d like to add any more performance optimizations or correctness fixes (e.g. walking down in addition to up), this PR is good to go. Just let me know if you want to add more or if I should merge this as-is.
Great! I've already added the logic to look down as well as up. I'll follow up with another PR over the next couple of days to implement the performance improvement. If you're good to merge, I'm good to merge too!
I use crux to extract article text. It works great! It's been a huge aid to build products that process article data.
However, we've noticed that sometimes crux will discard content that we wish it would keep. For example, in this article, we would like to retain all the links to twitter statuses, (e.g., https://twitter.com/StationCDRKelly/status/591594008046084096).
In 2.1.0, the default scrape looks like this:
This is a really good scrape. It contains all the text we want, and none of the text we don't. However, it discards the twitter links we want to track.
We could just pull all the twitter links from the whole document ahead of time, but we're only interested in the article content. We only want the twitter links that appear inside the article.
We need a way to ask crux to keep these very specific Twitter links. We could try to modify the crux internals to keep special whitelisted nodes, but crux already works really well, and other peoples' needs are different from ours, so any change we made probably wouldn't be a net positive on the whole. Instead, what we need is a way to communicate to the crux internals that these specific tags are important to us so that crux will keep them. Ideally, we would do that in such a way that every user could decide what's important for themselves, and crux would adapt accordingly. If we could add this feature so that users who don't need it can continue to use the library as if nothing had changed, then even better!
My team explored a few options -- a predicate function for identifying important nodes, a complex callback, etc. -- but we landed on a document annotation approach. It allows multiple pieces of code to collaborate to decide what nodes are important.
This patch adds a "hint" attribute,
crux-important
, that allows users to mark individual nodes as "important" and not to be removed. I would never expect this attribute to appear in HTML "in the wild," but it's useful when processing content. For example, to keep the twitter links, we could use code like this:Here, we pre-parse the document, pull out any social links, label them as important, and then use crux to extract the main content from this modified document.
The patch does not change how crux chooses the "top node" for the scrape, so the same element will get chosen no matter how many nodes you label as important and no matter where in the document those important nodes are. Once the top node is chosen, the changes to
PostprocessHelpers
will prevent the following nodes from being removed from under the top node: (a) nodes that are themselves important; (b) nodes that have an important ancestor. That way, for users who don't use this feature, nothing has changed, but for users who do use this feature, we can preserve and extract only those important nodes under top node. This allows us to extract the twitter links from inside the article content consistently and reliably.The patch also includes tests to make sure that the discard avoidance behavior works.
The updated library provides the following scrape:
The twitter links are preserved. Success!
Hopefully this all makes sense. Please let me know if there are any issues with this patch! My team is currently running a forked version of crux in production. It's working well, so we'd like to share this work back to the community. :)