dinubs / jam-api

Parse web pages using CSS query selectors
http://www.jamapi.xyz
Other
1.37k stars 61 forks source link

What is the correct syntax for python? #17

Closed ekt1701 closed 8 years ago

ekt1701 commented 8 years ago

Hello, I cannot find the correct syntax, I have tried this:

import requests r = requests.post('http://www.jamapi.xyz', 'url = "http://www.radcircle.com"', 'json_data = {"title": "title"}') print r.text

and get the error message: { "statusCode": 400, "error": "Bad Request", "message": "Invalid request payload JSON format" }

If I change it to this:

r = requests.post('http://www.jamapi.xyz', 'url = "http://www.radcircle.com"', json_data = '{"title": "title"}')

I get this: TypeError: request() got an unexpected keyword argument 'json_data'

dinubs commented 8 years ago

I don't know python very well, but from looking at the requests documentation for a bit, I think doing something like this may work.

import requests
payload = {'url': 'http://www.radcircle.com', 'json_data': '{"title": "title"}'}

r = requests.post("http://www.jamapi.xyz", data=payload)
print(r.json())

From the looks of it you have to make an object for the form parameters. If this works for you feel free to close this issue and I'll also add this to the code samples in the readme.

ekt1701 commented 8 years ago

Thank you so much, your code works for me.

BTW, using http://www.radcircle.com gives a database error: {u'title': u'Database Error'}

dinubs commented 8 years ago

Yeah, I think there's an issue with their site, but you can sub http://www.radcircle.com with any other url and it'll still work (so long as the page has a title element on it

ekt1701 commented 8 years ago

Hello, using jamapi, how can I extract the following data in Python? I still don't understand how to code the json correctly. The actual page is http://earthquaketrack.com/us-ca-los-angeles/recent and I would like to get the first 5 events, not all 19 events. Thanks in advance.

<h4 class='title text-muted'>1.9 magnitude earthquake</h4>
  <p>

    <abbr class="timeago" title="2016-09-20T23:15:40Z">
      2016-09-20 23:15:40 UTC
    </abbr>
    at 23:15 <br/>September 20, 2016 UTC
  </p>
  <p>
    <strong>Location:</strong><br/>
    Epicenter at 33.915, -118.304
      <br/>
      0.2 km from
  <a href="/us-ca-gardena/recent">Gardena</a>
    (0.2 miles)

  </p>
dinubs commented 8 years ago

It looks like the url you provided has blocked jamapi.xyz from accessing it. You'll have to deploy your own version of jamapi to heroku or something similar.

Here's the python code ran to determine that jamapi is actually forbidden, it returns an nginx error, and the title is 403 Forbidden

import requests

json = '{"title": "title", "body": {"elem": "body", "html": "html"}}'
payload = {'url': 'http://earthquaketrack.com/us-ca-los-angeles/recent', 'json_data': json}

r = requests.post("http://www.jamapi.xyz", data=payload)
print(r.json())
ekt1701 commented 8 years ago

Thank you for looking into it. I'll have to find another way to get that data.

ekt1701 commented 8 years ago

Sorry to bother you again, but what is the correct syntax in the json for: <div class='post-body entry-content' itemprop='description articleBody'>

I have tried:

div[itemprop=description articleBody] gets: "error": "A provided CSS selector was not found on the provided "

div[itemprop='description articleBody'] gets: SyntaxError: invalid syntax

div[itemprop=\'description articleBody\'] gets: "error": "invalid JSON"

dinubs commented 8 years ago

Maybe try w/ double quotes e.g. div[itemprop=\"description articleBody\"] There should probably be a fix for this, but I'm not sure how long it'll take.

ekt1701 commented 8 years ago

hmmm, when I tried that, I got the html for the home page: www.jamapi.xyz

EDIT, I'm getting that result, with code that worked before.

dinubs commented 8 years ago

try changing http://www.jamapi.xyz to https://www.jamapi.xyz there's an issue with the update to ssl today

ekt1701 commented 8 years ago

https, fixed the issue with the homepage.

However, div[itemprop=\"description articleBody\"] get "error": "invalid JSON"

Here is the entire json:

'json_data': '{"title": "title","paragraphs": [{ "elem": "div[itemprop=\"description articleBody\"] a:first-of-type", "text": "text"}]}'}

dinubs commented 8 years ago

what's the url you're trying to get, maybe I can take a look at it?

ekt1701 commented 8 years ago

http://doramaworld.blogspot.com/

I can get the title of each article with h3[itemprop=name], but not the body of the article.

I appreciate you taking a look.

EDIT: Is it possible to get the title and article in a single call?

dinubs commented 8 years ago

Currently you can't get the body and title in one object, but what you can do is set your json_data to be

{
  "post_titles": [{"elem": ".post .post-title a", "link": "href", "name": "text"}],
  "post_bodies": [".post .post-body"]
}

and this will return an array of post_titles that has the post title, and the permalink to the post, and the post_bodies array has all the post content in it.