Closed Baltazar500 closed 10 months ago
It assumes UTF-8 text
When the input comes from a server, it should automatically be converted to UTF-8 from the encoding in the HTTP headers
When the input comes from a server, it should automatically be converted to UTF-8 from the encoding in the HTTP headers
It’s the same story with a locally downloaded htm file.
this does not work
xidel -se 'json(extract(extract($raw, "var playerParams = \{.*\};"), "\{.*\}")).params()' ./vk.tmp
cat ./vk.tmp|xidel --stdin-encoding=cp1251 -se 'json(extract(extract($raw, "var playerParams = \{.*\};"), "\{.*\}")).params()'
it works
iconv -f cp1251 ./vk.tmp|xidel -se 'json(extract(extract($raw, "var playerParams = \{.*\};"), "\{.*\}")).params()'
Please update your old Xidel binary! We're at 0.9.9.8842 at the moment.
What you're looking for ultimately is probably...
xidel -s vk.tmp -e 'parse-json(//script/extract(.,"var playerParams = (.+);",1)).params()'
json()
has been replaced by parse-json()
and json-doc()
.//script
instead of $raw
.extract()
functions. One instance can extract the JSON-text no problem.iconv -f cp1251 ./vk.tmp|xidel ...
If I convert 'vk.tmp' to utf-8 (including
<meta content="charset=utf-8">
), then...xidel -s vk_utf8.tmp -e 'parse-json(extract($raw,"var playerParams = (.+);",1)).params()'
...using
extract()
with$raw
works fine, but this doesn't work with the windows-1251 'vk.tmp'. Maybe Benito could tell us why that is. But like I said; better to use//script
instead of$raw
anyway.
You could however use 2 instances of Xidel if you really want to use $raw
:
xidel -s vk.tmp -e . --output-format=html | \
xidel -se 'parse-json(extract($raw,"var playerParams= (.+);",1)).params()'
...which is essentially:
xidel -s vk.tmp -e . --output-format=html --output-encoding="utf-8" | \
xidel -s --stdin-encoding="utf-8" -e 'parse-json(extract($raw,"var playerParams= (.+);",1)).params()'
<meta http-equiv="content-type" content="text/html; charset=windows-1251" />
Xidel does not support charset windows-1251 on Linux
I have implemented only windows-1252 (latin1) and Unicode
But it can be compiled with uses cwstring
in xidel.pas, then it calls libc/iconv to handle unknown charsets
But it can be compiled with uses cwstring in xidel.pas, then it calls libc/iconv to handle unknown charsets
I don't know much about compilation. Can you build a binary with support for unknown charsets ?
Please pick your binary here. Judging by the modified/creation datetime it should include this latest commit.
@Reino17, Same story. Does not work. Need a pipe with iconv :(
apparently stdin-encoding is ignored for $raw and pipes
but the html parser looks at the encoding:
xidel vk.tmp -se 'json(extract(extract((//script[contains(., "playerParams")])[1], "var playerParams = \{.*\};"), "\{.*\}")).params()'
@benibela, OK. Thank you. This works, including on early releases.
"extract" with regular expressions (.*) extracts only ASCII text from html, skipping non-ASCII text. If I use curl + iconv and xidel on stdout there is no problem with processing. The input-encoding option is required (remote server encoding), output-encoding/stdin-encoding does not help