benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

Cannot remove nodes from html #79

Closed mazznoer closed 2 years ago

mazznoer commented 2 years ago

I want to remove all scripts from html but not succed.

$ xidel --version
Xidel 0.9.8
(20180421.6162.1f357eaaf5f3)

http://www.videlibri.de/xidel.html
by Benito van der Zander <benito@benibela.de>
$ xidel file.txt -s --html -e 'x:replace-nodes(//script,())'
<!DOCTYPE html>
Error:
err:XPST0017: unknown function: x:replace-nodes #2
Did you mean: 
  In module http://www.w3.org/2005/xpath-functions:
    replace #3-4:  (string?, string, string) as string;   (string?, string, string, string) as string
$ xidel file.txt -s --html -e 'let $delete := //script return transform(/, function($e){ if ($delete[$e is .]) then () else $e})'
<!DOCTYPE html>
Error:
err:XQDY0025: Duplicate attribute: title in TXQTermConstructorComputed
Possible backtrace:
  $0000000000525D3F: perhaps TXQTermConstructor + 10639 ? but unlikely
  $00000000005233CD: TXQTermConstructor + 29
  $000000000051E405: perhaps TXQTermBinaryOp + 3205 ? but unlikely
  $00000000005233CD: TXQTermConstructor + 29
  $000000000051E405: perhaps TXQTermBinaryOp + 3205 ? but unlikely
  $00000000005233CD: TXQTermConstructor + 29
  $000000000051E405: perhaps TXQTermBinaryOp + 3205 ? but unlikely
  $00000000005233CD: TXQTermConstructor + 29
  $000000000051E405: perhaps TXQTermBinaryOp + 3205 ? but unlikely
  $00000000005233CD: TXQTermConstructor + 29
  $000000000051E405: perhaps TXQTermBinaryOp + 3205 ? but unlikely
  $00000000005233CD: TXQTermConstructor + 29
  $000000000051E405: perhaps TXQTermBinaryOp + 3205 ? but unlikely
  $00000000005233CD: TXQTermConstructor + 29
  $000000000051E405: perhaps TXQTermBinaryOp + 3205 ? but unlikely
  $00000000005233CD: TXQTermConstructor + 29

Call xidel with --trace-stack to get an actual backtrace

Thanks for this useful tool.

Reino17 commented 2 years ago

x:replace-nodes() was introduced with xidel-0.9.9.20201125.7684.

mazznoer commented 2 years ago

Using version 0.9.9 is working now, but with some other html input it failed.

$ xidel --version
Xidel 0.9.9
(20210818.8090.c8e45f7fe96e)

http://www.videlibri.de/xidel.html
by Benito van der Zander <benito@benibela.de>

$ xidel file.txt -s --html -e 'x:replace-nodes(//script,())' --color never
<!DOCTYPE html>
Error:
err:XQDY0025: Duplicate attribute: title
in TXQTermConstructorComputed
Possible backtrace:
  $00000000005421E2: perhaps TXQTermConstructor + 11986 ? but unlikely
  $000000000053F473: TXQTermConstructor + 355
  $000000000053A0A1: perhaps TXQTermBinaryOp + 3153 ? but unlikely
  $000000000053F473: TXQTermConstructor + 355
  $000000000053A0A1: perhaps TXQTermBinaryOp + 3153 ? but unlikely
  $000000000053F473: TXQTermConstructor + 355
  $000000000053A0A1: perhaps TXQTermBinaryOp + 3153 ? but unlikely
  $000000000053F473: TXQTermConstructor + 355
  $000000000053A0A1: perhaps TXQTermBinaryOp + 3153 ? but unlikely
  $000000000053F473: TXQTermConstructor + 355
  $000000000053A0A1: perhaps TXQTermBinaryOp + 3153 ? but unlikely
  $000000000053F473: TXQTermConstructor + 355
  $000000000053A0A1: perhaps TXQTermBinaryOp + 3153 ? but unlikely
  $000000000053F473: TXQTermConstructor + 355
  $000000000053A0A1: perhaps TXQTermBinaryOp + 3153 ? but unlikely
  $000000000053F473: TXQTermConstructor + 355

Call xidel with --trace-stack to get an actual backtrace
mazznoer commented 2 years ago

Try deleting nodes on html input that does not contain the node, return just the doctype. A bug?

<!doctype html>
<html lang="en-US">
<head>
    <meta charset="utf-8">
    <meta http-equiv="x-ua-compatible" content="ie=edge">
    <title>Test</title>
    <meta name="viewport" content="width=device-width, initial-scale=1">
<style>
body {
    background: #fff;
    color: #333;
}
</style>
</head>
<body>
<div id='main'></div>
</body>
</html>
$ xidel test.html -s --html -e 'x:replace-nodes(//script,())' --color never
<!DOCTYPE html>
Reino17 commented 2 years ago

$ xidel file.txt -s --html -e 'x:replace-nodes(//script,())' --color never

What's the content of 'file.txt'?

Try deleting nodes on html input that does not contain the node, return just the doctype. A bug?

Why would you want to try to remove a non-existing node? Anyway...

https://www.benibela.de/documentation/internettools/xpath-functions.html#x-replace-nodes:

Currently it is implemented trivially by calling x:transform on the document and filtering for $nodes.

I'm not sure how transform() is called (in the background) when doing x:replace-nodes(//script,()), because transform(/,function($x){if (name($x)="script") then () else $x}) has a different outcome. Whether it's a bug or not, Benito would have to answer.

mazznoer commented 2 years ago

What's the content of 'file.txt'?

Here is minimal code for testing.

<!doctype html>
<html lang="en-US">
<head>
    <meta charset="utf-8">
    <meta http-equiv="x-ua-compatible" content="ie=edge">
    <title>Test</title>
    <meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>

<img src="cat.jpg" title="cat" title="cat">
<script></script>

</body>
</html>

err:XQDY0025: Duplicate attribute: title

The cause of this error is in the error message actually.

Why would you want to try to remove a non-existing node?

Just for testing.

Reino17 commented 2 years ago

<img src="cat.jpg" title="cat" title="cat">

I don't know what the origin is of this minimal code, but a duplicate title attribute (as the errors already mentions) is invalid. Remove one.

benibela commented 2 years ago

That is the unfortunate combination of a lousy HTML parser with a very spec conformant XQuery processor

The HTML parser should remove one of the title attributes, but it does not, so the document is invalid.

replace-nodes - written in XQuery - cannot output an invalid document. You could do x:replace-nodes(//@title, ()) to remove them

Try deleting nodes on html input that does not contain the node, return just the doctype. A bug?

There is a three argument version of the function for this case, x:replace-nodes(/, //script, ()), which has been recently added

I'm not sure how transform() is called (in the background) when doing x:replace-nodes(//script,()), because transform(/,function($x){if (name($x)="script") then () else $x}) has a different outcome. Whether it's a bug or not, Benito would have to answer.

The two argument version x:replace-nodes(//script, ()), calls the three arg version similarly to x:replace-nodes(root((//script)[1]), //script, ()),

That way, when the nodes exists, it returns the correct document, when multiple documents are loaded

benibela commented 2 years ago

I have fixed it by porting that function from XQuery to Pascal

It takes over 200 lines of Pascal do the same as 13 lines of XQuery, but it is also much faster now