dinubs / jam-api

Parse web pages using CSS query selectors
http://www.jamapi.xyz
Other
1.37k stars 57 forks source link

Support for more attributes? #5

Closed kasbah closed 8 years ago

kasbah commented 8 years ago

Would it be possible to select by other tag attributes? I am currently looking at parsing sites that have a lot of unique itemprop attributes like: <td itemprop="description">IC MCU 8BIT 32KB FLASH 44TQFP</td> from this site but hardly any ids or classes. I think these are ASP sites.

omgmog commented 8 years ago

Behind the scenes, this Jam API is using Cheerio.

So you can target a specific element quite easily using attribute selectors, or even structural pseudo element selectors

e.g.

To select any <td itemprop="description">:

td[itemprop=description]

To get the second td in a table, for every tr after the third tr:

tr:nth-child(n+3) td:nth-child(2)

Or to get the first occurrence of an element:

td:first-of-type

Also, not clearly documented, I believe you can choose to use the value of an attribute when you define the JSON structure in this way:

"elem": "td[itemprop]",
"value": "itemprop"

As is sort of referenced in the example

{
  "title": "title",
  "logo": ".nav-logo img",
  "paragraphs": [{ "elem": ".home-post h1", "value": "text"}], 
  "links": [{"elem": ".home-post > a:first-of-type", "location": "href"}]
}
kasbah commented 8 years ago

Oh, yeah.. neat!

More concretely, you can:

 curl -d url=http://www.digikey.co.uk/product-detail/en/atmel/ATMEGA32U4-AU/ATMEGA32U4-AU-ND/1914602 -d json_data='{"description":"td[itemprop=description]"}' http://www.jamapi.xyz/ 

to get

{
    "description": "\n                                                IC MCU 8BIT 32KB FLASH 44TQFP\n                                            "
}

I guess the bug is for this to be better documented.