Scifabric / pybossa

PYBOSSA is the ultimate crowdsourcing framework (aka microtasking) to analyze or enrich data that can't be processed by machines alone.
http://pybossa.com
GNU Affero General Public License v3.0
745 stars 267 forks source link

API to export all completed task runs #1260

Closed dchhabda closed 8 years ago

dchhabda commented 8 years ago

@teleyinex I would like export all(with no limit) completed task runs from pybossa server and looking for an option like an export API that can be used to perform this job. The inputs could be project id. If no input provided, export completed task runs for all the projects. I am looking for client being

  1. simple curl command
  2. RESTful api client
  3. pbs client

This is to export everything and not limited to 20 records(default) or the max 100 records(using Limit parameter), so export all completed task runs at the time when this api caller is triggered

teleyinex commented 8 years ago

Hi,

This is something that will have a huge impact on the server. For small projects it wont be a problem, but for servers like Crowdcrafting this will be too much (more than 2 million of task runs). The current API allows you simple pagination, and keyset pagination. Why those options do not work for you?

If we have those limits is because we want to avoid bad users exploiting the API with calls (i.e. robots or scripts just pulling that endpoint all the time). That's why we have a limitation on the API results (as any other web server i.e. Twitter returns a max of 150 tweets for users). Thus, I would like to understand a bit more about this in order to see how we can work on this issue.

dchhabda commented 8 years ago

@teleyinex, I agree to your concern on having all taskruns at once impacting server. I understand that pagination is reasonable, provided it gets completed tasksruns in pages and there is no limit to it. The documentation says max limit is 100 and default is 20. Is the limit=100 per page? Means, if I have overall 450 taskruns and limit=100 set, I'll get complete data in 5 pages; first 4 pages with 100 records and last one with 50 records?

Any sample code to refer for pagination will be of great help that explains how to use offset and limit parameters for fetching all completed taskruns. Thanks in advance

dchhabda commented 8 years ago

@teleyinex , I have a working python code below that loops through all completed task and gets taskruns. Help me with your suggestions on enhancing this to apply pagination and not restricted by default 20 records. Thanks again, import requests import json res = requests.get('http://mypybossa.com/api/task?state=completed')

tasks = res.json() for task in tasks: taskId = task['id'] taskrunUrl = 'http://mypybossa.com/api/taskrun?task_id=' + str(taskId) res2 = requests.get(taskrunUrl) if res2.status_code != 200: print "error obtaining taskruns for task: %d, reason: %r" % (taskId, res2.reason) continue

taskruns = res2.json() for taskrun in taskruns: print json.dumps(taskrun['info'], indent=3) print "user id: %d" % taskrun['user_id'] print "task created on: %s" % taskrun['created'] print "task finished on: %s" % taskrun['finish_time']

teleyinex commented 8 years ago

Hi,

The code looks good, but I think you can do that much faster and simpler with our enki software. Basically, to download all completed tasks and its associated answers, you do the following:

    >>> import enki

    # setup the server connection
    >>> e = enki.Enki(api_key='your-key', endpoint='http://server',
                      project_short_name='your-project-short-name')
    # Get all completed tasks and its associated task runs
    >>> e.get_all()

With that three lines of code you get all the completed tasks and its associated answers. Then, to check what you want you can do the following:

for task in e.tasks:
   # This will print the task_run_ID and the user_id
   print "user id:%d" % e.task_runs_df[task.id]['user_id']

While this is interesting, the best part is that you can do statistically analysis really quick. For example:

# For example, for a given task of your project:
    >>> task = e.tasks[0]
    # Let's analyze it (note: if the answer is a simple string like 'Yes' or 'No'):
    >>> e.task_runs_df[task.id]['info'].describe()
    count       1
    unique      1
    top       Yes
    freq        1
    dtype: object

    # Otherwise, if the answer in info is a dict: info = {'x': 32, 'y': 24}
    # Enki explodes the info field, using its keys (x, y) for new data frames:
    >>> e.task_runs_df[task.id]['x'].describe()
    count    100.000000
    mean     265.640000
    std        4.358945
    min      235.000000
    25%      264.000000
    50%      266.000000
    75%      268.000000
    max      278.000000
    dtype: float64

Therefore, imagine you've a project with answers Yes, No and I don't know. 30 people should participate in each task. The project has 10 tasks, but only 1 has been completed. The following code will show you the top answer by those 30 people for the completed task:

for task in e.tasks:
    desc = e.task_runs_df[t.id]['answer'].describe()
    print "The top answer for task.id %s is %s" % (t.id, desc['top'])

Printing something like this:

The top answer for task.id 256071 is No The top answer for task.id 256072 is Yes The top answer for task.id 256073 is Yes The top answer for task.id 256074 is Yes The top answer for task.id 256075 is Yes

The best part of Enki is that you can get all this done automagically. If you prefer to do it yourself, then in our documentation we explain how you can use limits and pagination to get all the data by hand: http://docs.pybossa.com/en/latest/api.html. I copy and paste here an excerpt for the docs:

You can paginate the results of any GET query using the last ID of the domain object that you have received and the parameter: last_id. For example, to get the next 20 items after the last project ID that you’ve received you will write the query like this: GET /api/project?last_id={{last_id}}.

Therefore, if you want to get batches of 100 items, you can use requests as follows:

tasks = []
res = requests.get('server/api/task?limit=100)
data = json.loads(res.data)

while (len(data) !=0):
   res = requests.get('server/api/task?limit=100&last_id=' + data[-1]['id'])
   data = json.loads(res.data)
   # Add data to tasks

I hope this helps you. My recommendation: use Enki or our pbclient library. It'll make things simpler.

dchhabda commented 8 years ago

@teleyinex , thanks for the details. The enki looks promising and I'll explore them. On the other hand, since I'd RESTful code handy without pagination, I applying your suggested code for pagination and observed 2 issues. Are these issues because I am running older version of pybossa than when .data and last_id was introduced into pybossa? If yes, is upgrade to the latest version the only option to use pagination? Issues

  1. there is no member named "data" in my returned res object
  2. Since no "data", I continued using data = res.json() and found that there is no last_id param either. Here is the modified code as you suggested,

res = requests.get('server/api/task?limit=100) data = res.json() #json.loads(res.data) while (len(data) !=0): url = 'server/api/task?limit=100&last_id=' + str(data[-1]['id']) res = requests.get(url) data = res.json() #json.loads(res.data)

no res.data (Pdb) dir(res) ['attrs', 'bool', 'class', 'delattr', 'dict', 'doc', 'format', 'getattribute', 'getstate', 'hash', 'init', 'iter', 'module', 'new', 'nonzero', 'reduce', 'reduce_ex', 'repr', 'setattr', 'setstate', 'sizeof', 'str', 'subclasshook', 'weakref', '_content', '_content_consumed', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']

last_id exception (Pdb) data {u'status': u'failed', u'target': u'task', u'exception_msg': u"type object 'Task' has no attribute 'last_id'", u'status_code': 415, u'exception_cls': u'AttributeError', u'action': u'GET'}

teleyinex commented 8 years ago

If you're using an old version of PyBossa, then you can use the old pagination method: limit plus offset. Basically, request in each call an offset:


limit = 100
offset = 0

url = 'server/api/task?offset=%s&limit=%s" % (offset, limit)
res = requests.get(url)

data = res.json()

while(len(data)!=0):
   offset = offset + limit
   url = 'server/api/task?offset=%s&limit=%s" % (offset, limit)
   res = request.get(url)
   data = res.json()

We introduced keyset pagination because it's much faster. We recommend you to update your server to get not only this feature but also the new importers and security fixes.

All the best,

Daniel

dchhabda commented 8 years ago

Thanks Daniel; offset, limit worked with the older server version. I agree with you on upgrading to latest server in near future.