benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

Need help building JSON from variables #29

Closed dleeftink closed 5 years ago

dleeftink commented 5 years ago

I am experimenting with combining different extraction methods, in this case CSS selectors, templates and Xpath/Xquery (using Powershell). Following this answer on Stackoverflow, it seems variables (in my case, "$header", "$time", "$length" and "$author") can be written to JSON by using the file:write-text and serialize-json functions; however I cannot get them to work.

I know JSON formatting can be written directly in multipage templates, but I am specifically trying to combine different extraction methods through Powershell. My question then is, how can I build a JSON file using Xidel using the following script:

.\xidel links.html -f //a -e "header:=css('div.section-content div h1')" -e '<time datetime={$time}></time>' -e '<span class="readingTime" title={$length}></span>' -e author:='distinct-values(//a[@data-user-id])'

Reino17 commented 5 years ago

I think it would help if you'd show the content of 'links.html' and how the json you're trying to make has to look like.

dleeftink commented 5 years ago

This is probably not the intended use, but the links.html contains a list of <a href=""></a> tags, as to enable a custom list of links to be followed and scraped like so:

<a href="https://onezero.medium.com/this-is-silicon-valley-3c4583d6e7c2"></a> <a href="https://towardsdatascience.com/how-to-learn-data-science-if-youre-broke-7ecc408b53c7"></a> <a href="https://towardsdatascience.com/one-neural-network-many-uses-image-captioning-image-search-similar-image-and-words-in-one-model-1e22080ce73d"></a> etc

In the script in my initial post, I've determined various elements to be extracted from Medium.com articles. I've opted for combining different extraction methods, as Medium articles contain a plethora of dynamic attributes and selectors, making it difficult to go for straight Xidel templating. At least, I haven't figured out how to adapt the template system to complex pages yet (it must also be said that my actual script contains more elements to be scraped than the ones from my initial post).

dleeftink commented 5 years ago

And the intended JSON output for each link followed in the links.html file:

{ "header": $var, "time": $var, "length": $var, "author": $var }

I know Medium provides a structured excerpt of each article in the footer, but I'd like to see if I can build a JSON from Xidel variables myself, as this can be adopted to other websites as well.

dleeftink commented 5 years ago

For anyone looking for a Powershell solution, append the following to your xidel -e "variable := xquery/xpath/css selector/whatever" script:

--xquery "serialize-json({| ('variable_1', 'variable_2', 'variable_3', 'etc') ! {.: get(.)} |})" | select-string -pattern '^{' | convertfrom-json | convertto-json | out-file your_file.json -encoding oem

It pipes the complete output to Powershell, selects the string objects that contain a { curly bracket at the beginning of each line, prettifies the output by converting from/to JSON and finally writing it to a file.

I still do not entirely understand the formatting within serialize-json though, what do the pipe | and ! symbols within the {curly braces} indicate? (I am no programmer)

Reino17 commented 5 years ago

The following leaves your multiple-extraction-method-question still unanswered, but this is how I would create a json from medium.com articles with the information you require:

xidel -s https://medium.com/topic/editors-picks --xquery "[let $a:=json(//script/extract(.,'APOLLO_STATE__ = (.+)',1)[.]) for $x in $a()[starts-with(.,'Post:')][not(contains(.,'quotes'))] return $a($x)/{'title':title,'author':$a(creator/id)/name,'date':updatedAt div 1000 * duration('PT1S') + dateTime('1970-01-01T00:00:00'),'length':concat(ceiling(readingTime),' min read')}]"
[
  {
    "title": "Facebook Is Eroding Trust in Two-Factor Authentication",
    "author": "Eric Ravenscraft",
    "date": "2019-03-08T16:00:26.843",
    "length": "4 min read"
  },
  {
    "title": "The Case for Visiting the Outer Planets",
    "author": "Shannon Stirone",
    "date": "2019-03-07T18:49:47.306",
    "length": "4 min read"
  },
  {
    "title": "How I Became Addicted to an On-Demand Gig",
    "author": "Steve Cordrey",
    "date": "2019-03-08T13:01:00.441",
    "length": "8 min read"
  },
  {
    "title": "The Smartest Questions to Ask Your Doctor",
    "author": "Rae Nudson",
    "date": "2019-03-06T14:01:01.077",
    "length": "5 min read"
  },
  {
    "title": "Browser Tabs Are Ruining Your Brain",
    "author": "Angela Lashbrook",
    "date": "2019-03-06T14:01:01.213",
    "length": "8 min read"
  },
  {
    "title": "Are AirPods and Other Bluetooth Headphones Safe?",
    "author": "Markham Heid",
    "date": "2019-03-07T16:11:57.625",
    "length": "4 min read"
  },
  {
    "title": "I'm Happily Child-Free but I Still Support Universal Daycare",
    "author": "Meghan Daum",
    "date": "2019-03-06T21:34:00.384",
    "length": "7 min read"
  },
  {
    "title": "The Value of Inconvenient Design",
    "author": "Jesse Weaver",
    "date": "2019-03-07T23:08:51.96",
    "length": "8 min read"
  },
  {
    "title": "Why People Buy $30 Power Cords Against All Reason",
    "author": "Foster Kamer",
    "date": "2019-03-07T17:07:17.024",
    "length": "8 min read"
  },
  {
    "title": "The Best Strategies to Boost Your Willpower",
    "author": "Maggie Puniewska",
    "date": "2019-03-07T14:01:01.053",
    "length": "6 min read"
  }
]

This works for https://medium.com/topic/editors-picks as well as https://medium.com/topic/members.

--xquery "
  [
    let $a:=json(
      //script/extract(
        .,
        'APOLLO_STATE__ = (.+)',
        1
      )[.]
    )
    for $x in $a()[starts-with(.,'Post:')][not(contains(.,'quotes'))]
    return $a($x)/{
      'title':title,
      'author':$a(creator/id)/name,
      'date':updatedAt div 1000 * duration('PT1S') + dateTime('1970-01-01T00:00:00'),
      'length':concat(
        ceiling(readingTime),
        ' min read'
      )
    }
  ]
"
benibela commented 5 years ago

And the intended JSON output for each link followed in the links.html file: { "header": $var, "time": $var, "length": $var, "author": $var }

The simplest way is to just put the variables there (without get) and append

 --xquery ' { "header": $header, "time": $time, "length": $length, "author": $author } '

to the initial script. (now I do not know if powershell needs ' or " quotes)

The unneeded lines can be hidden with --extract-exclude header,time,length,author at the beginning.

file:write-text and serialize-json functions; however I cannot get them to work.

should be like

 --xquery  `file:write-text("outputfile", serialize-json({ "header": $header, "time": $time, "length": $length, "author": $author }))`

although in this case it needs to be file:append-text, or you only get the last line, since write-text overrides the file

I still do not entirely understand the formatting within serialize-json though, what do the pipe | and ! symbols within the {curly braces} indicate? (I am no programmer)

You can think of ! repeating the expression on the right side for every value on the left side and replacing . with the current value. i.e.

 "serialize-json({| ('variable_1', 'variable_2', 'variable_3', 'etc') ! {.: get(.)} |}) 

is a shortcut that is expanded to

  serialize-json({| { 'variable_1': get('variable_1') }, { 'variable_2': get('variable_2') }, {'variable_3': get('variable_3')}, {'etc': get('etc')}  |})

| is not a pipe in this case. Think of {| .. |} as another way to write a json object, that is formed by combining all the json objects contained within

dleeftink commented 5 years ago

Thanks @Reino17 and @benibela, both methods work great. In both cases however, Powershell keeps all variables in memory, rather than writing/appending to a file and overwriting the original variables. Is it possible to clear all variables between/after each extraction loop to keep memory usage down?

dleeftink commented 5 years ago

Bump

benibela commented 5 years ago

Which variables is Powershell keeping?

Xidel itself keeps most variables as long as it runs. All HTML pages it has downloaded. All variables not set with let or for, some caches. There is the x:garbage-collect()' function to free a few variables, but only a few.

benibela commented 5 years ago

From the 20190521.6878.c8b00ac4ad40 dev build on it will release the memory of unused documents.

Variables from the pattern matching can only be deleted with x:clear-log("variablename")