benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

can i use xquery flwor with xidel? #39

Closed zpimp closed 4 years ago

zpimp commented 4 years ago

i tried something simple but i dont know how to read the file, this doesnt work

for $a in doc("books.xml")//author order by $a/last, $a/first return $a/last

Reino17 commented 4 years ago

http://www.benibela.de/documentation/internettools/xpath-functions.html#fn-doc:

fn:doc($uri as xs:string?) as document-node()?
Retrieves a document using a URI supplied as an xs:string, and returns the corresponding document node. See the XPath/XQuery function reference.

If you want to open a local file, then...

xidel -s "books.xml" --xquery '
  for $a in //author
  order by $a/last,$a/first
  return $a/last
'

...to let Xidel open the file directly is the first thing to do.

If you really want to open the file "in-query", then parse-xml(file:read-text("books.xml")) would do.

zpimp commented 4 years ago
xidel -s "wg" --xquery '
  for $d in //tbody
  return count($d/tr) $d
'

how can i return 2 items , and the xpath of $d

what im trying to do is loop through all the stuff in the html and create a csv with the important data ive done some scraping, and i want to make a generic scraper

edit: i found this: https://stackoverflow.com/questions/31625472/how-can-i-get-full-path-of-each-node-in-xml-file

so i tried this

xidel -s "wg" --xquery '
  for $d in //tbody
  where count($d/tr)>30
  return node()/replace(path(), "Q[{][^}]*[}]", "")
'

but it doesent select the >30 condition

this also doesent work

xidel -s "wg" --xquery '
  for $d in //tbody
  if (count($d/tr)>30) let $a := $d
  return $d/node()/replace(path(), "Q[{][^}]*[}]", "")
'

is there some place where i can find more complex examples like this ?

2nd edit:

xidel -s "wg" --xquery '
  for $d in //tbody
where count($d/tr)>30
  return  $d/node()/replace(path(), "Q[{][^}]*[}]", "")
'

this seems to list the full paths of all subnodes, i only want the xpaths of the tbody's which contain tr's there are 3 tbody's wich satisfy the condition, more than 30 tr rows, but it gives me houndreds

this is the page i tried: https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

3rd edit: i hacked this up, but i feel there has to be a more elegant solution

  for $d in //*
where count($d/*)>30
  return  $d/*[30]/node()/replace(path(), "Q[{][^}]*[}]", "")
'|sed 's:^\(.*\)/.*$:\1:'|sort|uniq

getting the 30th subnodes then removing the last child from xpath, getting the node i wanted in the first place

4th edit:

xidel -s "wg" --xquery '
   for $d in //*
return(
if (count($d/*)>30) then $d/replace(path(), "Q[{][^}]*[}]", "")
else ()
 )'

i think i solved my problem posted here maybe it will be useful to others the problem seem to be $d/node() means subnode of $d

Reino17 commented 4 years ago

What is "wg"? Is the html-source of https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal) ?

what im trying to do is loop through all the stuff in the html and create a csv with the important data

That's not what your latest query is doing. I'm having a hard time trying to figure out what output you're actually looking for. Please explain.

zpimp commented 4 years ago
$ time ./xidel -s "wg" --xquery '
   for $d in //*                                return(
if (count($d/*)>30) then $d/replace(path(), "Q[{][^}]*[}]", "")
else ())'

/html[1]/body[1]/div[1]/div[1]/main[1]/div[3]/div[1]/div[1]/div[2]/table[1]/tbody[1]/tr[2]/td[1]/table[1]/tbody[1]

/html[1]/body[1]/div[1]/div[1]/main[1]/div[3]/div[1]/div[1]/div[2]/table[1]/tbody[1]/tr[2]/td[2]/table[1]/tbody[1]

/html[1]/body[1]/div[1]/div[1]/main[1]/div[3]/div[1]/div[1]/div[2]/table[1]/tbody[1]/tr[2]/td[3]/table[1]/tbody[1]
real    0m0.529s
user    0m0.350s                                
sys     0m0.040s

yes that is the url and this is the output i need, the nodes wich contain other nodes

its just a step for how i thought i can get a csv from html

how would you suggest to convert html to csv? is there a more simple way?

Reino17 commented 4 years ago

Since you're still not clear on what you're actually after, I'm going to assume you want to turn the 3 tables on that wiki-page into a CSV list, like:

1,United States,21.439.453
-,European Union,18.705.132
2,China,14.140.163
[...]

This website is not simple! There are of course multiple ways to accomplish this, but this is how I would do it.

I would start off by focusing on the <table class="wikitable sortable" [...]> node that holds the 3 tables.
Next, all the information you need is in the 3rd <tr> node and onward, so that would mean //table[@class="wikitable sortable"]//tr[position() > 2] so far.

-e '//table[@class="wikitable sortable"]//tr[position() > 2]/td[1]'
1
-
2
3
4
5
[...]
192

The index (the 1st td node) looks okay.

-e '//table[@class="wikitable sortable"]//tr[position() > 2]/td[2]//a[@title]'
United States
European Union
China
Japan
Germany
[...]
Tuvalu

The country names look okay too.
((td[2]//text())[2], or in other words the 2nd text-node of the 2nd td node, would work too)

-e '//table[@class="wikitable sortable"]//tr[position() > 2]/td[3]'
21,439,453

18,705,132

14,140,163

5,154,475

3,863,344

[...]

38

The GDP values don't look okay yet. The new line needs to be removed for every value and in addition I would replace the commas with dots to prevent confusion in a CSV list.

-e '//table[@class="wikitable sortable"]//tr[position() > 2]/replace(replace(td[3],"\r\n?|\n",""),",",".")'
21.439.453
18.705.132
14.140.163
5.154.475
3.863.344
[...]
38

(x:lines(td[3])[1] would work too to remove the new lines. x:lines() is a shorthand for tokenize(.,"\r\n?|\n"))

The final step is to join all values together separated by a comma:

xidel -s "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)" -e '
  //table[@class="wikitable sortable"]//tr[position() > 2]/join(
    (
      td[1],
      td[2]//a[@title],
      replace(
        replace(td[3],"\r\n?|\n",""),
        ",","."
      )
    ),
    ","
  )
'
1,United States,21.439.453
-,European Union,18.705.132
2,China,14.140.163
3,Japan,5.154.475
4,Germany,3.863.344
5,India,2.935.570
[...]
192,Tuvalu,38
zpimp commented 4 years ago

you are right, im trying to get those tables into csv's but i dont want to make a specific scraper for this page, i want to make a generic scraper, and this page was for testing

my idea is that relevant data in a page is normally in a form o subnodes to a key node (or 3 in this case) its either td's in a table or div's in some other node

there was a website that i saw do this scraper.io not sure, cant find it right now, it didnt get all data and was limited but thats where i got my idea from

this is not fit-all solution there are other cases and some data resides in attributes, not nodes, and right now i can get xpaths of subnodes only, but not xpaths of their attributes

i hope i was able to explain in more detail what i want, i realise its a bit more work, but i think its doable

2nd edit:

time ./xidel cfch.html -e '/html[1]/body[1]/div[1]/div[1]/div[2]/div[8]/*[22]/(*|*//*|@*|*//@*|text()|*//text())/replace(path(), "Q[{][^}]*[}]", "")'

i managed to put this together this seems to get all the xpaths of a subnode[22] with all the immediate subnodes, and any depth, likewise attributes, and text

but this means i have to process the file every time for each value, and if the file is too big it will take a while, for one single value it takes 0.7 seconds, its a 6mb file with 4000 nodes wich have about 130 subnodes/text/attributes, this means 4000(nodes)130(subnodes)0.7(sec per run)/3600=101 hours if im not mistaking

what im thinking now is to process the page in one go, listing every xpath and value, and from there ill do it in bash perl

this is what ive come up with, but its only listing the xpath not the value

time ./xidel -s "cfch.html" --xquery '
  for $d in /html[1]/body[1]/div[1]/div[1]/div[2]/div[8]/*  
   return( 
$d/(*|*//*|@*|*//@*|text()|*//text())/replace(path(), "Q[{][^}]*[}]", "")
)'                    |wc -l
496328

real    0m55.978s
user    0m55.579s
sys 0m0.531s

3d edit: i need something like this, this is a smaller page, same site: it gives me what i need but for each node there are130 lines with xpaths and then 130 lines with values, i need them to be xpath, value on each line im not getting this xquery stuff :)

time ./xidel -s "cf69.html" --xquery '
   for $d in /html[1]/body[1]/div[1]/div[1]/div[2]/div[8]/*  
    return( 
 let $e:=$d/(*|*//*|@*|*//@*|text()|*//text())
    return(
 $e/replace(path(), "Q[{][^}]*[}]", "") , $e
 )
 )'                    |wc -l
13074

real    0m0.121s
user    0m0.107s
sys 0m0.020s

4th edit: i think i found what i needed this prints xpath;value for each subnode in the page from here its just text processing

time ./xidel -s "cf69.html" --xquery '
  for $d in /html[1]/body[1]/div[1]/div[1]/div[2]/div[8]/* 
return(

for $e in $d/(*|*//*|@*|*//@*|text()|*//text())

return concat(
$e/replace(path(), "Q[{][^}]*[}]", "") , " ;",  $e
)

)'

thanks for your help

Reino17 commented 4 years ago

i want to make a generic scraper

Impossible if you ask me. Every websites is different.

real 0m55.978s

Such an approach is really inefficient!

zpimp commented 4 years ago

in the end it comes down to aligning a csv , cause missing a node offsets the data im not trying to make it completely automatic, there would be some input needed, for example there is lot of unneeded crap like classes and ids, these columns have to be selected for elimination manually

edit: results are promising from 4008 lines, 233 are not aligned wich is 5.8% so if im correct its 94% successful in about 25 seconds :)

there is a div with an image with messes everything now i need a command line program to align offset lines there are some columns that have the same value on each line but they are offset cause of the missing nodes

can i list all xpaths in a document but with classname div[classname]/div[classname] instead of xpath order div[1]/div[2]

2nd edit: if i remove the offending node, another 54 sec, so about 2 minutes in total i get 47/4008 lines not aligned wich is 98% right