Closed kasbah closed 8 years ago
Behind the scenes, this Jam API is using Cheerio.
So you can target a specific element quite easily using attribute selectors, or even structural pseudo element selectors
e.g.
To select any <td itemprop="description">
:
td[itemprop=description]
To get the second td
in a table, for every tr
after the third tr
:
tr:nth-child(n+3) td:nth-child(2)
Or to get the first occurrence of an element:
td:first-of-type
Also, not clearly documented, I believe you can choose to use the value of an attribute when you define the JSON structure in this way:
"elem": "td[itemprop]",
"value": "itemprop"
As is sort of referenced in the example
{
"title": "title",
"logo": ".nav-logo img",
"paragraphs": [{ "elem": ".home-post h1", "value": "text"}],
"links": [{"elem": ".home-post > a:first-of-type", "location": "href"}]
}
Oh, yeah.. neat!
More concretely, you can:
curl -d url=http://www.digikey.co.uk/product-detail/en/atmel/ATMEGA32U4-AU/ATMEGA32U4-AU-ND/1914602 -d json_data='{"description":"td[itemprop=description]"}' http://www.jamapi.xyz/
to get
{
"description": "\n IC MCU 8BIT 32KB FLASH 44TQFP\n "
}
I guess the bug is for this to be better documented.
Would it be possible to select by other tag attributes? I am currently looking at parsing sites that have a lot of unique
itemprop
attributes like:<td itemprop="description">IC MCU 8BIT 32KB FLASH 44TQFP</td>
from this site but hardly any ids or classes. I think these are ASP sites.