benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

Empty output when using full xpath for a html file #88

Closed Baltazar500 closed 2 years ago

Baltazar500 commented 2 years ago

Hi.

I am getting empty output when using the full xpath for a html file.

Expression :


 xidel -se '/html/body/center/table/tr/td' './test.html'

Example :


<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN" "">
<html>
  <head>
    <meta charset="utf-8"/>
  </head>
  <body>
    <center>
      <table border="1" cellpadding="5">
        <tr>
          <td>123</td>
        </tr>
        <tr>
          <td>456</td>
        </tr>
      </table>
    </center>
  </body>
</html>

With all this, when using the pipe and processing stdout of the same file, everything works


cat ./test.html|xidel -se '/html/body/center/table/tr/td'
123
456

What is the problem ?

benibela commented 2 years ago

It is turning the file into:


<html>
  <head>
    <meta charset="utf-8"/>
  </head>
  <body>
    <center>
      <table border="1" cellpadding="5">
        <tbody><tr>
          <td>123</td>
        </tr>
        <tr>
          <td>456</td>
        </tr>
      </tbody></table>
    </center>

</body></html>

So you need xidel -se '/html/body/center/table/tbody/tr/td' './test.html'

That is how it is supposed to be in HTML5. Although Xidel does not have a HTML5 parser. Looks like I started implementing parts of HTML5 and then forgot about them...

benibela commented 2 years ago

With all this, when using the pipe and processing stdout of the same file, everything works


cat ./test.html|xidel -se '/html/body/center/table/tr/td'
123
456

What is the problem ?

And the difference is the file name. If it sees the file name test.html, it uses the HTML parser. In the pipe, there is no file name, so when it sees <?xml, it uses the XML parser.

If you rename it to test.xml, it will also use the XML parser

benibela commented 2 years ago

It is the same in Firefox:

grafik

grafik

Baltazar500 commented 2 years ago

If you rename it to test.xml, it will also use the XML parser

Yes, it works. And it worked too

xidel --input-format=xml -se '/html/body/center/table/tr/td' './test.html' 123 456

Baltazar500 commented 2 years ago

Any changes expected ? Should I close this issue or will you close it ?