bitsummation / pickaxe

SQL Based DSL Web Scraper/Screen Scraper
Apache License 2.0
155 stars 15 forks source link

How to parse if css selector is non-unique #4

Open skanga opened 8 years ago

skanga commented 8 years ago

Consider this HTML snippet

<p class="attrgroup"><span><b>2001 KAWASAKI DRIFTER</b></span></p>
<p class="attrgroup">
<span>condition: <b>excellent</b></span>
<br><span>engine displacement (CC): <b>800</b></span>
<br><span>fuel: <b>gas</b></span>
<br><span>odometer: <b>17500</b></span>
<br><span>paint color: <b>red</b></span>
<br><span>title status: <b>clean</b></span>
<br><span>transmission: <b>manual</b></span><br>
</p>

In this case how do you extract "columns" for fuel, odometer, transmission, etc especially in the case when a) the order may be different and b) some fields may be missing.

Note: this snippet is taken from https://sfbay.craigslist.org/nby/mcy/5623911440.html

breeve1 commented 8 years ago

The below works. If the nth-child is different for some reason, I would do multiple selects into a table buffer. One might be empty but the other won't. Case statements are also supported in the select clause.

select
    pick ''
from download page 'https://sfbay.craigslist.org/nby/mcy/5623911440.html'
where nodes = '.mapAndAttrs .attrgroup:nth-child(3) span'
breeve1 commented 8 years ago

The below might get you started. I wrote this code to go through the pages and get all the details for a motorcycle.

create buffer totalCount(count int)

insert into totalCount
select 
    pick '.totalcount'
from download page 'https://sfbay.craigslist.org/search/nby/mcy'

create buffer pageUrls(url string)

insert into pageUrls
select 'https://sfbay.craigslist.org/search/nby/mcy'

each(var c in totalCount) {

var pagesCounts = c.count / 100

insert into pageUrls
select 'https://sfbay.craigslist.org/search/nby/mcy?s=' + value
from expand (1 to pagesCounts){$ * 100}
}

create buffer detailsUrls(url string)

insert into detailsUrls
select
    'https://sfbay.craigslist.org' + pick 'a.hdrlnk' take attribute 'href'
from download page (select url from pageUrls) with (thread(5))
where nodes = '.content .rows p'

var detailDownloads = download page (select url from detailsUrls) with (thread(10)) --download all pages

create buffer motorCycle(url string, title string)
insert into motorCycle
select
    url, 
    pick '#titletextonly'
from detailDownloads

create buffer motorCycleDetails(url string, metric string) 

insert into motorCycleDetails
select
    url,
    pick ''
from detailDownloads
where nodes = '.mapAndAttrs .attrgroup:nth-child(3) span'

insert into motorCycleDetails
select
    url,
    pick ''
from detailDownloads
where nodes = '.mapAndAttrs .attrgroup:nth-child(2) span'

select m.url, title, metric
from motorCycle m
join motorCycleDetails d on d.url = m.url
VitoVan commented 8 years ago

Does this nth-child way solve the

a) the order may be different and b) some fields may be missing.

problems?

Seems not so convincible.

breeve1 commented 8 years ago

So in this case I believe it does. Since we are selecting all the spans and if some metrics are missing they won't be in the spans. We are also selecting the full text so each row would be:

engine displacement (CC): 800 fuel: gas odometer: 17500 paint color: red title status: clean

So it is category:value pairs

VitoVan commented 8 years ago

Thank you for the explanation, I just found a windows machine and executed your script, result It works just right, but... Can we get something like this:

url engine fuel odometer paint color
http://www.example.com/moto?id=1 800 gas 17500 red
http://www.example.com/moto?id=2 800 gas NULL blue
http://www.example.com/moto?id=3 800 gas 20500 NULL
breeve1 commented 8 years ago

You can run on the mac with mono. You have to download the command line.

What about the below at the end of the script. I don't have group by implemented but once in SQL you can do a group by and do a min or max on all the columns to flatten them out. I have regular expressions implemented but right now they only work on pick statements but that would be easy to add so it could cleanup the text inside the case statements to eliminated the value pair once that is done.

create buffer final(url string, motorCycle string, condition string, engine string, fuel string, odometer string, paint string, title string, transission string)

insert into final
select
    m.url,
    title,
    case when metric like '%condi' then metric else '' end,
    case when metric like '%engine' then metric else '' end,
    case when metric like '%fuel' then metric else '' end,
    case when metric like '%odomet' then metric else '' end,
    case when metric like '%paint' then metric else '' end,
    case when metric like '%title' then metric else '' end,
    case when metric like '%trans' then metric else '' end
from motorCycle m
join motorCycleDetails d on d.url = m.url

select *
from final
VitoVan commented 8 years ago

@breeve1 This will work, thank you.