benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

Using xidel to extract non-ASCII text ? #109

Closed Baltazar500 closed 10 months ago

Baltazar500 commented 11 months ago

"extract" with regular expressions (.*) extracts only ASCII text from html, skipping non-ASCII text. If I use curl + iconv and xidel on stdout there is no problem with processing. The input-encoding option is required (remote server encoding), output-encoding/stdin-encoding does not help

benibela commented 11 months ago

It assumes UTF-8 text

When the input comes from a server, it should automatically be converted to UTF-8 from the encoding in the HTTP  headers

Baltazar500 commented 11 months ago

When the input comes from a server, it should automatically be converted to UTF-8 from the encoding in the HTTP headers

It’s the same story with a locally downloaded htm file.

this does not work

xidel -se 'json(extract(extract($raw, "var playerParams = \{.*\};"), "\{.*\}")).params()' ./vk.tmp

cat ./vk.tmp|xidel --stdin-encoding=cp1251 -se 'json(extract(extract($raw, "var playerParams = \{.*\};"), "\{.*\}")).params()'

it works

iconv -f cp1251 ./vk.tmp|xidel -se 'json(extract(extract($raw, "var playerParams = \{.*\};"), "\{.*\}")).params()'

vk.tmp.zip

Reino17 commented 11 months ago

Please update your old Xidel binary! We're at 0.9.9.8842 at the moment.

What you're looking for ultimately is probably...

xidel -s vk.tmp -e 'parse-json(//script/extract(.,"var playerParams = (.+);",1)).params()'
iconv -f cp1251 ./vk.tmp|xidel ...

If I convert 'vk.tmp' to utf-8 (including <meta content="charset=utf-8">), then...

xidel -s vk_utf8.tmp -e 'parse-json(extract($raw,"var playerParams = (.+);",1)).params()'

...using extract() with $raw works fine, but this doesn't work with the windows-1251 'vk.tmp'. Maybe Benito could tell us why that is. But like I said; better to use //script instead of $raw anyway.

You could however use 2 instances of Xidel if you really want to use $raw:

xidel -s vk.tmp -e . --output-format=html | \
xidel -se 'parse-json(extract($raw,"var playerParams= (.+);",1)).params()'

...which is essentially:

xidel -s vk.tmp -e . --output-format=html --output-encoding="utf-8" | \
xidel -s --stdin-encoding="utf-8" -e 'parse-json(extract($raw,"var playerParams= (.+);",1)).params()'
benibela commented 11 months ago
 <meta http-equiv="content-type" content="text/html; charset=windows-1251" />

Xidel does not support charset windows-1251 on Linux

I have implemented only windows-1252 (latin1) and Unicode

But it can be compiled with uses cwstring in xidel.pas, then it calls libc/iconv to handle unknown charsets

Baltazar500 commented 10 months ago

But it can be compiled with uses cwstring in xidel.pas, then it calls libc/iconv to handle unknown charsets

I don't know much about compilation. Can you build a binary with support for unknown charsets ?

Reino17 commented 10 months ago

Please pick your binary here. Judging by the modified/creation datetime it should include this latest commit.

Baltazar500 commented 10 months ago

@Reino17, Same story. Does not work. Need a pipe with iconv :(

benibela commented 10 months ago

apparently stdin-encoding is ignored for $raw and pipes

but the html parser looks at the encoding:

xidel vk.tmp -se 'json(extract(extract((//script[contains(., "playerParams")])[1], "var playerParams = \{.*\};"), "\{.*\}")).params()'
Baltazar500 commented 10 months ago

@benibela, OK. Thank you. This works, including on early releases.