benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

Multiple statements #96

Closed agguser closed 1 year ago

agguser commented 1 year ago

How to perform multiple statements (e.g. to both remove \

and replace \ with its href)?

Using multiple -e does not work (it outputs 2 nested documents; it seems that they perform independently).

xidel https://example.com --html \
   -e 'x:replace-nodes(//h1, ())' \
   -e 'x:replace-nodes(//a, function($e){ x"{$e/@href}" })'

You have to pipe.

xidel https://example.com --html -e 'x:replace-nodes(//h1, ())' |
   xidel - --html -e 'x:replace-nodes(//a, function($e){ x"{$e/@href}" })'

Is there another way (e.g. using semicolon)?

xidel https://example.com --html -e '
   x:replace-nodes(//h1, ());
   x:replace-nodes(//a, function($e){ x"{$e/@href}" })'

benibela commented 1 year ago

You can use ! or -> like the semicolon

! is from XPath 3.1 and only works if //a is not empty

-> is from XPath 4.0

Reino17 commented 1 year ago

@agguser So you can use:

x:replace-nodes(x:replace-nodes(//h1,())//a,string(//a/@href))
x:replace-nodes(//h1,())/x:replace-nodes(.//a,string(//a/@href))
x:replace-nodes(//h1,()) ! x:replace-nodes(.//a,string(//a/@href))
x:replace-nodes(//h1,()) -> x:replace-nodes(//a,string(//a/@href))

@benibela

$ xidel -s example.com.htm --html -e '
  x:replace-nodes(//h1,()) => x:replace-nodes(//a,string(//a/@href))
'

Unlike ->, with the XPath 3.1 => it doesn't seem to work. Or do you have to use it differently?

And about the output...
From:

[...]
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

To:

[...]
<div>

    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p>https://www.iana.org/domains/example</p>
</div>

</body></html>

I think I've asked you about this white-space-issue before, but I really don't understand why 2 extra new lines are added after the latest div and why the new line after the closing body-tag is removed.

agguser commented 1 year ago

x:replace-nodes seem to return empty document if no matches, is this a bug?

$  xidel https://example.com --html -e 'x:replace-nodes(//h2, ())'
<!DOCTYPE html>
benibela commented 1 year ago

string(//a/@href)

That only works if there is only one link

$ xidel -s example.com.htm --html -e '
 x:replace-nodes(//h1,()) => x:replace-nodes(//a,string(//a/@href))
'

That is an abbreviation for

x:replace-nodes(x:replace-nodes(//h1,()), //a, string(//a/@href))

//a are nodes in the old document. They do not exist in the new document, so they cannot be replaced

x:replace-nodes seem to return empty document if no matches, is this a bug?

 $  xidel https://example.com --html -e 'x:replace-nodes(//h2, ())'

It needs to know a document. Write x:replace-nodes(/, //h2, ()), Without a document, it uses the document containing //h2

agguser commented 1 year ago

Thanks for the clarification. The documentation of x:replace-nodes shows only two parameters; please update it.