Closed liar666 closed 8 years ago
Concerning both problems I've been thinking of extracting/importing the [outerH|h]tml of the tag, then using a "postImportProcessor" that would use regexes to extract the info I want (in case 1: get 'outerHML' + regex to extract content of @src
attribute, in case 2: get 'html' + remove everything with tags).
Do you think it's the good approach?
It turns out the DOMTagger only supports extracting values of elements (including children) and not attributes at the moment. I am marking this as a feature request to allow extracting attribute values as well as an element text without its children.
In the meantime, I think your approach may be the best. Either in two steps like you are describing, or falling back on regex alone, using using a tagger like TextPatternTagger may also do it.
The Importer module has been updated with a solution. A new snapshot release of HTTP Collector was made with this updated Importer module.
DOMTagger (as well as DOMContentFilter
) now supports many new DOM extraction options. You can now get a tag "own" text (excluding children) with "ownText" and you can get the value of an attribute with "attr(attributeName)".
Give it a try and confirm.
Just tried the ''attr(src)' on an img HTML tag and it worked great, Thanks :)
Hi,
Sorry to reopen this issue since it is solved, but I just noticed that the docs where not fully updated with the new feature :)
Indeed, in page:
https://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/DOMTagger.html
You added the full explanation: "Since 2.5.0, it is possible to control what gets extracted..."
But you forgot to update the "XML configuration usage" section:
...
extract="[text|html|outerHtml]" />
...
I updated the documentation. Thanks!
Hi (again!),
In one of my crawlers, I need to extract very specific information. I couldn't find a way to do that, even with the help of the DOMTagger. I'd like to have your opinion if there is a simple way to do what I need or if I need to create my own XPathTagger.
What I need (using XPath notation in code elements below):
@src
attribute from an img html-tag, to get the url of an image. E.g. if the source document contains<img src='image.jpeg'/>
, I'd like to get "image.jpeg" as a result.text()
part of a tag without the content of the child nodes. E.g.: with the source document being: "