Open skanga opened 8 years ago
The below works. If the nth-child is different for some reason, I would do multiple selects into a table buffer. One might be empty but the other won't. Case statements are also supported in the select clause.
select
pick ''
from download page 'https://sfbay.craigslist.org/nby/mcy/5623911440.html'
where nodes = '.mapAndAttrs .attrgroup:nth-child(3) span'
The below might get you started. I wrote this code to go through the pages and get all the details for a motorcycle.
create buffer totalCount(count int)
insert into totalCount
select
pick '.totalcount'
from download page 'https://sfbay.craigslist.org/search/nby/mcy'
create buffer pageUrls(url string)
insert into pageUrls
select 'https://sfbay.craigslist.org/search/nby/mcy'
each(var c in totalCount) {
var pagesCounts = c.count / 100
insert into pageUrls
select 'https://sfbay.craigslist.org/search/nby/mcy?s=' + value
from expand (1 to pagesCounts){$ * 100}
}
create buffer detailsUrls(url string)
insert into detailsUrls
select
'https://sfbay.craigslist.org' + pick 'a.hdrlnk' take attribute 'href'
from download page (select url from pageUrls) with (thread(5))
where nodes = '.content .rows p'
var detailDownloads = download page (select url from detailsUrls) with (thread(10)) --download all pages
create buffer motorCycle(url string, title string)
insert into motorCycle
select
url,
pick '#titletextonly'
from detailDownloads
create buffer motorCycleDetails(url string, metric string)
insert into motorCycleDetails
select
url,
pick ''
from detailDownloads
where nodes = '.mapAndAttrs .attrgroup:nth-child(3) span'
insert into motorCycleDetails
select
url,
pick ''
from detailDownloads
where nodes = '.mapAndAttrs .attrgroup:nth-child(2) span'
select m.url, title, metric
from motorCycle m
join motorCycleDetails d on d.url = m.url
Does this nth-child
way solve the
a) the order may be different and b) some fields may be missing.
problems?
Seems not so convincible.
So in this case I believe it does. Since we are selecting all the spans and if some metrics are missing they won't be in the spans. We are also selecting the full text so each row would be:
engine displacement (CC): 800 fuel: gas odometer: 17500 paint color: red title status: clean
So it is category:value pairs
Thank you for the explanation, I just found a windows machine and executed your script, It works just right, but... Can we get something like this:
url | engine | fuel | odometer | paint color |
---|---|---|---|---|
http://www.example.com/moto?id=1 | 800 | gas | 17500 | red |
http://www.example.com/moto?id=2 | 800 | gas | NULL | blue |
http://www.example.com/moto?id=3 | 800 | gas | 20500 | NULL |
You can run on the mac with mono. You have to download the command line.
What about the below at the end of the script. I don't have group by implemented but once in SQL you can do a group by and do a min or max on all the columns to flatten them out. I have regular expressions implemented but right now they only work on pick statements but that would be easy to add so it could cleanup the text inside the case statements to eliminated the value pair once that is done.
create buffer final(url string, motorCycle string, condition string, engine string, fuel string, odometer string, paint string, title string, transission string)
insert into final
select
m.url,
title,
case when metric like '%condi' then metric else '' end,
case when metric like '%engine' then metric else '' end,
case when metric like '%fuel' then metric else '' end,
case when metric like '%odomet' then metric else '' end,
case when metric like '%paint' then metric else '' end,
case when metric like '%title' then metric else '' end,
case when metric like '%trans' then metric else '' end
from motorCycle m
join motorCycleDetails d on d.url = m.url
select *
from final
@breeve1 This will work, thank you.
Consider this HTML snippet
In this case how do you extract "columns" for fuel, odometer, transmission, etc especially in the case when a) the order may be different and b) some fields may be missing.
Note: this snippet is taken from https://sfbay.craigslist.org/nby/mcy/5623911440.html