Closed dleeftink closed 5 years ago
I think it would help if you'd show the content of 'links.html' and how the json you're trying to make has to look like.
This is probably not the intended use, but the links.html contains a list of <a href=""></a>
tags, as to enable a custom list of links to be followed and scraped like so:
<a href="https://onezero.medium.com/this-is-silicon-valley-3c4583d6e7c2"></a>
<a href="https://towardsdatascience.com/how-to-learn-data-science-if-youre-broke-7ecc408b53c7"></a>
<a href="https://towardsdatascience.com/one-neural-network-many-uses-image-captioning-image-search-similar-image-and-words-in-one-model-1e22080ce73d"></a>
etc
In the script in my initial post, I've determined various elements to be extracted from Medium.com articles. I've opted for combining different extraction methods, as Medium articles contain a plethora of dynamic attributes and selectors, making it difficult to go for straight Xidel templating. At least, I haven't figured out how to adapt the template system to complex pages yet (it must also be said that my actual script contains more elements to be scraped than the ones from my initial post).
And the intended JSON output for each link followed in the links.html file:
{ "header": $var, "time": $var, "length": $var, "author": $var }
I know Medium provides a structured excerpt of each article in the footer, but I'd like to see if I can build a JSON from Xidel variables myself, as this can be adopted to other websites as well.
For anyone looking for a Powershell solution, append the following to your xidel -e "variable := xquery/xpath/css selector/whatever"
script:
--xquery "serialize-json({| ('variable_1', 'variable_2', 'variable_3', 'etc') ! {.: get(.)} |})"
| select-string -pattern '^{' | convertfrom-json | convertto-json | out-file your_file.json -encoding oem
It pipes the complete output to Powershell, selects the string objects that contain a {
curly bracket at the beginning of each line, prettifies the output by converting from/to JSON and finally writing it to a file.
I still do not entirely understand the formatting within serialize-json
though, what do the pipe |
and !
symbols within the {curly braces}
indicate? (I am no programmer)
The following leaves your multiple-extraction-method-question still unanswered, but this is how I would create a json from medium.com articles with the information you require:
xidel -s https://medium.com/topic/editors-picks --xquery "[let $a:=json(//script/extract(.,'APOLLO_STATE__ = (.+)',1)[.]) for $x in $a()[starts-with(.,'Post:')][not(contains(.,'quotes'))] return $a($x)/{'title':title,'author':$a(creator/id)/name,'date':updatedAt div 1000 * duration('PT1S') + dateTime('1970-01-01T00:00:00'),'length':concat(ceiling(readingTime),' min read')}]"
[
{
"title": "Facebook Is Eroding Trust in Two-Factor Authentication",
"author": "Eric Ravenscraft",
"date": "2019-03-08T16:00:26.843",
"length": "4 min read"
},
{
"title": "The Case for Visiting the Outer Planets",
"author": "Shannon Stirone",
"date": "2019-03-07T18:49:47.306",
"length": "4 min read"
},
{
"title": "How I Became Addicted to an On-Demand Gig",
"author": "Steve Cordrey",
"date": "2019-03-08T13:01:00.441",
"length": "8 min read"
},
{
"title": "The Smartest Questions to Ask Your Doctor",
"author": "Rae Nudson",
"date": "2019-03-06T14:01:01.077",
"length": "5 min read"
},
{
"title": "Browser Tabs Are Ruining Your Brain",
"author": "Angela Lashbrook",
"date": "2019-03-06T14:01:01.213",
"length": "8 min read"
},
{
"title": "Are AirPods and Other Bluetooth Headphones Safe?",
"author": "Markham Heid",
"date": "2019-03-07T16:11:57.625",
"length": "4 min read"
},
{
"title": "I'm Happily Child-Free but I Still Support Universal Daycare",
"author": "Meghan Daum",
"date": "2019-03-06T21:34:00.384",
"length": "7 min read"
},
{
"title": "The Value of Inconvenient Design",
"author": "Jesse Weaver",
"date": "2019-03-07T23:08:51.96",
"length": "8 min read"
},
{
"title": "Why People Buy $30 Power Cords Against All Reason",
"author": "Foster Kamer",
"date": "2019-03-07T17:07:17.024",
"length": "8 min read"
},
{
"title": "The Best Strategies to Boost Your Willpower",
"author": "Maggie Puniewska",
"date": "2019-03-07T14:01:01.053",
"length": "6 min read"
}
]
This works for https://medium.com/topic/editors-picks as well as https://medium.com/topic/members.
--xquery "
[
let $a:=json(
//script/extract(
.,
'APOLLO_STATE__ = (.+)',
1
)[.]
)
for $x in $a()[starts-with(.,'Post:')][not(contains(.,'quotes'))]
return $a($x)/{
'title':title,
'author':$a(creator/id)/name,
'date':updatedAt div 1000 * duration('PT1S') + dateTime('1970-01-01T00:00:00'),
'length':concat(
ceiling(readingTime),
' min read'
)
}
]
"
$a
.$a()[starts-with(.,'Post:')][not(contains(.,'quotes'))]
returns:
Post:d74b6e68452f
Post:2a8baf6017d8
Post:b12757820524
Post:ccf4151b845b
Post:214a0449e13a
Post:ec3513687f02
Post:f0ae1773dd77
Post:94fb6cbcc298
Post:5eb1b2d9af2b
Post:68a9b36f3e4b
And the intended JSON output for each link followed in the links.html file: { "header": $var, "time": $var, "length": $var, "author": $var }
The simplest way is to just put the variables there (without get) and append
--xquery ' { "header": $header, "time": $time, "length": $length, "author": $author } '
to the initial script. (now I do not know if powershell needs '
or "
quotes)
The unneeded lines can be hidden with --extract-exclude header,time,length,author
at the beginning.
file:write-text and serialize-json functions; however I cannot get them to work.
should be like
--xquery `file:write-text("outputfile", serialize-json({ "header": $header, "time": $time, "length": $length, "author": $author }))`
although in this case it needs to be file:append-text
, or you only get the last line, since write-text
overrides the file
I still do not entirely understand the formatting within serialize-json though, what do the pipe | and ! symbols within the {curly braces} indicate? (I am no programmer)
You can think of !
repeating the expression on the right side for every value on the left side and replacing .
with the current value. i.e.
"serialize-json({| ('variable_1', 'variable_2', 'variable_3', 'etc') ! {.: get(.)} |})
is a shortcut that is expanded to
serialize-json({| { 'variable_1': get('variable_1') }, { 'variable_2': get('variable_2') }, {'variable_3': get('variable_3')}, {'etc': get('etc')} |})
|
is not a pipe in this case. Think of {| .. |}
as another way to write a json object, that is formed by combining all the json objects contained within
Thanks @Reino17 and @benibela, both methods work great. In both cases however, Powershell keeps all variables in memory, rather than writing/appending to a file and overwriting the original variables. Is it possible to clear all variables between/after each extraction loop to keep memory usage down?
Bump
Which variables is Powershell keeping?
Xidel itself keeps most variables as long as it runs. All HTML pages it has downloaded. All variables not set with let
or for
, some caches. There is the x:garbage-collect()'
function to free a few variables, but only a few.
From the 20190521.6878.c8b00ac4ad40 dev build on it will release the memory of unused documents.
Variables from the pattern matching can only be deleted with x:clear-log("variablename")
I am experimenting with combining different extraction methods, in this case CSS selectors, templates and Xpath/Xquery (using Powershell). Following this answer on Stackoverflow, it seems variables (in my case, "$header", "$time", "$length" and "$author") can be written to JSON by using the file:write-text and serialize-json functions; however I cannot get them to work.
I know JSON formatting can be written directly in multipage templates, but I am specifically trying to combine different extraction methods through Powershell. My question then is, how can I build a JSON file using Xidel using the following script:
.\xidel links.html -f //a
-e "header:=css('div.section-content div h1')"
-e '<time datetime={$time}></time>'
-e '<span class="readingTime" title={$length}></span>'
-e author:='distinct-values(//a[@data-user-id])'