Closed dchhabda closed 8 years ago
Hi,
This is something that will have a huge impact on the server. For small projects it wont be a problem, but for servers like Crowdcrafting this will be too much (more than 2 million of task runs). The current API allows you simple pagination, and keyset pagination. Why those options do not work for you?
If we have those limits is because we want to avoid bad users exploiting the API with calls (i.e. robots or scripts just pulling that endpoint all the time). That's why we have a limitation on the API results (as any other web server i.e. Twitter returns a max of 150 tweets for users). Thus, I would like to understand a bit more about this in order to see how we can work on this issue.
@teleyinex, I agree to your concern on having all taskruns at once impacting server. I understand that pagination is reasonable, provided it gets completed tasksruns in pages and there is no limit to it. The documentation says max limit is 100 and default is 20. Is the limit=100 per page? Means, if I have overall 450 taskruns and limit=100 set, I'll get complete data in 5 pages; first 4 pages with 100 records and last one with 50 records?
Any sample code to refer for pagination will be of great help that explains how to use offset and limit parameters for fetching all completed taskruns. Thanks in advance
@teleyinex , I have a working python code below that loops through all completed task and gets taskruns. Help me with your suggestions on enhancing this to apply pagination and not restricted by default 20 records. Thanks again, import requests import json res = requests.get('http://mypybossa.com/api/task?state=completed')
tasks = res.json() for task in tasks: taskId = task['id'] taskrunUrl = 'http://mypybossa.com/api/taskrun?task_id=' + str(taskId) res2 = requests.get(taskrunUrl) if res2.status_code != 200: print "error obtaining taskruns for task: %d, reason: %r" % (taskId, res2.reason) continue
taskruns = res2.json() for taskrun in taskruns: print json.dumps(taskrun['info'], indent=3) print "user id: %d" % taskrun['user_id'] print "task created on: %s" % taskrun['created'] print "task finished on: %s" % taskrun['finish_time']
Hi,
The code looks good, but I think you can do that much faster and simpler with our enki software. Basically, to download all completed tasks and its associated answers, you do the following:
>>> import enki
# setup the server connection
>>> e = enki.Enki(api_key='your-key', endpoint='http://server',
project_short_name='your-project-short-name')
# Get all completed tasks and its associated task runs
>>> e.get_all()
With that three lines of code you get all the completed tasks and its associated answers. Then, to check what you want you can do the following:
for task in e.tasks:
# This will print the task_run_ID and the user_id
print "user id:%d" % e.task_runs_df[task.id]['user_id']
While this is interesting, the best part is that you can do statistically analysis really quick. For example:
# For example, for a given task of your project:
>>> task = e.tasks[0]
# Let's analyze it (note: if the answer is a simple string like 'Yes' or 'No'):
>>> e.task_runs_df[task.id]['info'].describe()
count 1
unique 1
top Yes
freq 1
dtype: object
# Otherwise, if the answer in info is a dict: info = {'x': 32, 'y': 24}
# Enki explodes the info field, using its keys (x, y) for new data frames:
>>> e.task_runs_df[task.id]['x'].describe()
count 100.000000
mean 265.640000
std 4.358945
min 235.000000
25% 264.000000
50% 266.000000
75% 268.000000
max 278.000000
dtype: float64
Therefore, imagine you've a project with answers Yes, No and I don't know. 30 people should participate in each task. The project has 10 tasks, but only 1 has been completed. The following code will show you the top answer by those 30 people for the completed task:
for task in e.tasks:
desc = e.task_runs_df[t.id]['answer'].describe()
print "The top answer for task.id %s is %s" % (t.id, desc['top'])
Printing something like this:
The top answer for task.id 256071 is No The top answer for task.id 256072 is Yes The top answer for task.id 256073 is Yes The top answer for task.id 256074 is Yes The top answer for task.id 256075 is Yes
The best part of Enki is that you can get all this done automagically. If you prefer to do it yourself, then in our documentation we explain how you can use limits and pagination to get all the data by hand: http://docs.pybossa.com/en/latest/api.html. I copy and paste here an excerpt for the docs:
You can paginate the results of any GET query using the last ID of the domain object that you have received and the parameter: last_id. For example, to get the next 20 items after the last project ID that you’ve received you will write the query like this: GET /api/project?last_id={{last_id}}.
Therefore, if you want to get batches of 100 items, you can use requests as follows:
tasks = []
res = requests.get('server/api/task?limit=100)
data = json.loads(res.data)
while (len(data) !=0):
res = requests.get('server/api/task?limit=100&last_id=' + data[-1]['id'])
data = json.loads(res.data)
# Add data to tasks
I hope this helps you. My recommendation: use Enki or our pbclient library. It'll make things simpler.
@teleyinex , thanks for the details. The enki looks promising and I'll explore them. On the other hand, since I'd RESTful code handy without pagination, I applying your suggested code for pagination and observed 2 issues. Are these issues because I am running older version of pybossa than when .data and last_id was introduced into pybossa? If yes, is upgrade to the latest version the only option to use pagination? Issues
res = requests.get('server/api/task?limit=100) data = res.json() #json.loads(res.data) while (len(data) !=0): url = 'server/api/task?limit=100&last_id=' + str(data[-1]['id']) res = requests.get(url) data = res.json() #json.loads(res.data)
no res.data (Pdb) dir(res) ['attrs', 'bool', 'class', 'delattr', 'dict', 'doc', 'format', 'getattribute', 'getstate', 'hash', 'init', 'iter', 'module', 'new', 'nonzero', 'reduce', 'reduce_ex', 'repr', 'setattr', 'setstate', 'sizeof', 'str', 'subclasshook', 'weakref', '_content', '_content_consumed', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
last_id exception (Pdb) data {u'status': u'failed', u'target': u'task', u'exception_msg': u"type object 'Task' has no attribute 'last_id'", u'status_code': 415, u'exception_cls': u'AttributeError', u'action': u'GET'}
If you're using an old version of PyBossa, then you can use the old pagination method: limit plus offset. Basically, request in each call an offset:
limit = 100
offset = 0
url = 'server/api/task?offset=%s&limit=%s" % (offset, limit)
res = requests.get(url)
data = res.json()
while(len(data)!=0):
offset = offset + limit
url = 'server/api/task?offset=%s&limit=%s" % (offset, limit)
res = request.get(url)
data = res.json()
We introduced keyset pagination because it's much faster. We recommend you to update your server to get not only this feature but also the new importers and security fixes.
All the best,
Daniel
Thanks Daniel; offset, limit worked with the older server version. I agree with you on upgrading to latest server in near future.
@teleyinex I would like export all(with no limit) completed task runs from pybossa server and looking for an option like an export API that can be used to perform this job. The inputs could be project id. If no input provided, export completed task runs for all the projects. I am looking for client being
This is to export everything and not limited to 20 records(default) or the max 100 records(using Limit parameter), so export all completed task runs at the time when this api caller is triggered