Extracting only attributes or text() of an entity?

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 68 forks source link

Extracting only attributes or text() of an entity? #258

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

Hi (again!),

In one of my crawlers, I need to extract very specific information. I couldn't find a way to do that, even with the help of the DOMTagger. I'd like to have your opinion if there is a simple way to do what I need or if I need to create my own XPathTagger.

What I need (using XPath notation in code elements below):

Extract the @src attribute from an img html-tag, to get the url of an image. E.g. if the source document contains <img src='image.jpeg'/>, I'd like to get "image.jpeg" as a result.
Extract the text() part of a tag without the content of the child nodes. E.g.: with the source document being: "text1text2" I would get anything that would look like: "text1", or "text1text2", or ["text1", "text2"]. It's very easy to do with XPath selectors, but I couldn't find a way to do that with the Jsoup/CSS selectors listed at https://jsoup.org/apidocs/org/jsoup/select/Selector.html .

liar666 commented 8 years ago

Concerning both problems I've been thinking of extracting/importing the [outerH|h]tml of the tag, then using a "postImportProcessor" that would use regexes to extract the info I want (in case 1: get 'outerHML' + regex to extract content of @src attribute, in case 2: get 'html' + remove everything with tags). Do you think it's the good approach?

essiembre commented 8 years ago

It turns out the DOMTagger only supports extracting values of elements (including children) and not attributes at the moment. I am marking this as a feature request to allow extracting attribute values as well as an element text without its children.

In the meantime, I think your approach may be the best. Either in two steps like you are describing, or falling back on regex alone, using using a tagger like TextPatternTagger may also do it.

essiembre commented 8 years ago

The Importer module has been updated with a solution. A new snapshot release of HTTP Collector was made with this updated Importer module.

DOMTagger (as well as DOMContentFilter) now supports many new DOM extraction options. You can now get a tag "own" text (excluding children) with "ownText" and you can get the value of an attribute with "attr(attributeName)".

Give it a try and confirm.

liar666 commented 8 years ago

Just tried the ''attr(src)' on an img HTML tag and it worked great, Thanks :)

liar666 commented 8 years ago

Hi,

Sorry to reopen this issue since it is solved, but I just noticed that the docs where not fully updated with the new feature :)

Indeed, in page:
https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/DOMTagger.html

You added the full explanation: "Since 2.5.0, it is possible to control what gets extracted..."

But you forgot to update the "XML configuration usage" section:

...
              extract="[text|html|outerHtml]" />
...

essiembre commented 8 years ago

I updated the documentation. Thanks!