Bulk export of tasks + task runs to CSV

Scifabric / pybossa

PYBOSSA is the ultimate crowdsourcing framework (aka microtasking) to analyze or enrich data that can't be processed by machines alone.

http://pybossa.com

GNU Affero General Public License v3.0

745 stars 269 forks source link

Bulk export of tasks + task runs to CSV #358

Closed rufuspollock closed 11 years ago

rufuspollock commented 11 years ago

Added to sprint 2 but as a suggestion - for discussion :-)

teleyinex commented 11 years ago

I was going to write today exactly the same proposal :-) I've an overall idea about how it will be, how to code it and how to merge it. Will we have a meeting, or could we discuss this here?

teleyinex commented 11 years ago

I'm going to work on this specific item today, and before submitting any code here is my design:

1.- All tasks will be exported to CSV using the keys of the info Task object field. If an item has nested data, the JSON will be dumped as it is in order to keep things as simple as possible (please, check this).

2.- The same will be done for TaskRuns, only the info field and it will be treated as a flat object. If there are any nested objects (i.e. in ForestWatchers we save the data for tasks and taskruns as GeoJSON so exporting it to CSV is really complicated and no useful for GIS applications where you will be expecting GeoJSON, GeoRSS and or KML)

3.- Include a third option where you can do the same, but exporting everything in the JSON format (we should add an option to import data also as JSON like for CSV).

teleyinex commented 11 years ago

New comments:

It is not possible to export Tasks + TaskRuns in the same CSV file as CSV does not support multiple sheets, so there should be an option to export all Tasks, and another one for TaskRuns.

With JSON it is possible (all_data = {'tasks': app.tasks, 'task_runs': app.task_runs), but in order to keep everything with the same structure, there should be also two options: (i) for tasks and another one (ii) for task runs

rufuspollock commented 11 years ago

I think as a first instance exporting tasks to one csv and taskruns to another is OK. We can think later if we can merge.

Agree about taking task run info and expanding that (so each field becomes a column) and would also obviously add task_id and possibly pybossa_id

Worth linking to an exemplar google docs spreadsheet with examples of what the sheets will look like ...

teleyinex commented 11 years ago

We cannot merge them basically because the info fields could be very different, so the best approach if we want to give users the option to download all at once is to create a zip file like Github does with repos.

I'll add the task_id and I think that you mean pybossa_id=app.id, right?

Regarding the examples: I'll do it, now exploring the problems :-D

teleyinex commented 11 years ago

I've found an example from the Flask lists. I include it here for documenting purpose:

from flask import Flask, Response
import os
from werkzeug import Headers

app = Flask(__name__)

@app.route('/')
def hello_word():
    def download():
        fich = open("fring.avi","r+b")
        while True:
            data = fich.read(4096)
            if not data: break
            yield data
    header = Headers()
    header.add("Content-Type", "application/x-download")
    header.add('Content-Length',  str(os.path.getsize("test.zip")))
    header.add("Content-Disposition", "attachment; filename=test.zip")
    return Response(download(), headers=header, direct_passthrough=True)

if __name__ == "__main__":
    app.debug = True
    app.run()

rufuspollock commented 11 years ago

I'd strongly prefer a non-zip approach so just having 2 separate files (just click 2 buttons). Also be nice to put this in the API (e.g. /api/1/export/task/{app_id}?format=csv

pybossa_id = the id of the task_run in pybossa

i really think doing a mock up of the export export in a gdocs spreadsheet linked from here would be really useful to clarify what we expect ...