benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
681 stars 42 forks source link

two similiar, valid xml files, different results #52

Closed SmartLayer closed 4 years ago

SmartLayer commented 4 years ago

What works:

$ xidel --data=https://repo.tokenscript.org/aw.app/2020/06/unicon.tsml --extract 'count(//*:contract)'
1

What doesn't work:

$ git clone git@github.com:AlphaWallet/TokenScript-Examples.git
$ cd TokenScript-Examples/examples/edcon
$ xidel --data=unicon.xml --extract 'count(//*:contract)'
0

The two files (unicon.tsml and unicon.xml) are both valid XML files (against one same schema) with identical content except that tsml files are canonicalised (and signed), while unicon.xml is not.

I haven't been able to narrow down this to the exact point of minimal-diffierence to cause the failure but I'll report this first.

benibela commented 4 years ago

It is probably the doctype.

Xidel could not handle that. But I have fixed that in Xidel 0.9.9

It also might be better to always use --input-format xml-strict, if you know the input is valid xml

SmartLayer commented 4 years ago

I'd love to test against xidel 0.9.9 (not released - test source code) but realised that it uses pascal compiler with a few hundred megabytes dependencies. will try compile one when I am in the office instead of using mobile internet. in the meanwhile, will try it if there is a 0.9.9 release. Thanks @benibela

benibela commented 4 years ago

Here are preview builds: https://sourceforge.net/projects/videlibri/files/Xidel/Xidel%20development/

SmartLayer commented 4 years ago

Thanks 0.9.9 seem to be working correctly !