benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

err:XPTY0004 invalid conversion to type singleton - xpath limitation ? #34

Closed zpimp closed 4 years ago

zpimp commented 4 years ago

i got 2 separate selects working but concat of those doesent there are more of the first outer div, this is the node that repeats html/body/div but i only left one for simplicity i understand it may be a xpath limitation,

what can i do? i read somewhere xidel supports some kind of script for/foreach iterating over nodes? but i cant find it

or transforming the children in sibling, is this even possible, would css selectors be easier? or xquery?

any help is greatly appreciated

this is the html

<html>
<body>

<div>
<span class="a">
    33
    <div>kg</div>
    234
    <div>m</div>
</span>

<span class="b">
    44
    <div>kg</div>
    345
    <div>m</div>
    5678
    <div>l</div>
</span>
</div>

</body>
</html>

this is the desired output

33|kg
234|m
44|kg
345|m
5678|l

xidel t2.html -e "html/body/div/span/text()"

33
234

44
345
5678

xidel t2.html -e "html/body/div/span/div/text()"

kg
m
kg
m
l

but xidel t2.html -e "html/body/div/concat(span/text(),'|',span/div/text())"

Error:
err:XPTY0004: Invalid conversion from (33
, 
234
, 
, 
44
, 
345
, 
5678
, 
) to type singleton
Q{http://www.benibela.de/2012/pxp/extensions}concat((
33
, 
234
, 
, 
44
, 
345
, 
5678
, 
), "|", (kg, m, kg, m, l))

thank you for your help

benibela commented 4 years ago

concat is only for single values

For a sequence of values you need to use string-join (with just one parameter)

Reino17 commented 4 years ago

Benito is right of course, but from what I understand you don't want to simply string-join the span values, you want to interleave them:

xidel -s t2.htm -e "//span/text()"

        33

        234

        44

        345

        5678

xidel -s t2.htm -e "//span/text() ! normalize-space(.)"
33
234

44
345
5678

xidel -s t2.htm -e "(//span/text() ! normalize-space(.))[.]"
33
234
44
345
5678

xidel -s t2.htm -e "//span/div"
kg
m
kg
m
l
xidel -s t2.htm --xquery "
  let $a:=//span/div
  for $x at $i in (//span/text() ! normalize-space(.))[.]
  return
  concat($x,'|',$a[$i])
"
33|kg
234|m
44|kg
345|m
5678|l

or

xidel -s t2.htm --xquery "
  for $x at $i in //span/div
  let $a:=(//span/text() ! normalize-space(.))[.]
  return
  concat($a[$i],'|',$x)
"
33|kg
234|m
44|kg
345|m
5678|l
Reino17 commented 4 years ago

Just figured that assigning //span/div to a variable isn't really necessary, because //span/div already contains a sequence:

xidel -s t2.htm --xquery "
  for $x at $i in (//span/text() ! normalize-space(.))[.]
  return
  concat($x,'|',(//span/div)[$i])
"
33|kg
234|m
44|kg
345|m
5678|l
benibela commented 4 years ago

Benito is right of course, but from what I understand you don't want to simply string-join the span values, you want to interleave them:

of course. the report is so long, I only skimmed it

Just figured that assigning //span/div to a variable isn't really necessary, because //span/div already contains a sequence:

it is better to use the variable method. let variables are very fast and // is very slow

You could also interleave with tumbling window

xidel -s t2.htm --xquery '
  for tumbling window $w in //span/(text()|div)!normalize-space()[.] 
  start at $i when true() 
  end at $j when $i ne $j
  return string-join($w, "|")
'
Reino17 commented 4 years ago

You could also interleave with tumbling window

Wait, what?! :D Where's that coming from? I can't find anything about it.

Btw, are there any differences between //span/(text()|div) and //span/(text(),div)?

benibela commented 4 years ago

You could also interleave with tumbling window

Wait, what?! :D Where's that coming from? I can't find anything about it.

It is the XQuery 3 window clause

Btw, are there any differences between //span/(text()|div)

That sorts the output in document order, i.e. all interleaved here

and //span/(text(),div)?

That sorts the output with all text first and then all div