pybossa.js assumes stable order

jtbates commented 8 years ago

While the user is working on a task, pybossa.js loads the next task using the offset parameter. After the user completes the task and the task run is submitted, pybossa.js assumes that the task it retrieved previously at offset=1 is now at offset=0 and that newtask?offset=1 will return a different task. If Issue #1 is fixed but the scheduler still violates this assumption, there can be an error where the task presenter tells the user that they have completed all tasks while there is still one left.

Here is what could happen:

There are three candidate tasks that the user has not completed: x y z .
The task presenter retrieves task x at offset=0 and y at offset=1.
The user completes task x and the task run is submitted.
There are now two candidate tasks left: y z.
The user is presented the preloaded task y.
Another request is made to newtask?offset=1 and the response is again y.
pybossa.js recognizes that this is the same task that is now presented and requests newtask?offset=2. The response is empty as there are only two candidate tasks left.
The user completes task y. The next preloaded task is empty and the task presenter tells the user that they have completed all tasks leaving z unfinished.

therealmarv commented 8 years ago

Big thanks for the deep analysis. I will use your steps for test and validation once rework on the random scheduler is done. The random scheduler is currently not intended to use for production setups (it breaks PYBOSSA currently) and needs to be migrated to the current PYBOSSA.

georgeslabreche commented 8 years ago

For our Decode Darfur microtasking project, I implemented this random task fetching function in task_repository.py:

    def get_random_ongoing_task(self, project_id, user_id, user_ip):

        # If an authenticated user requests for a random task
        if user_id is not None:
            sql = text('''
                SELECT * FROM "task"
                LEFT JOIN "task_run"
                ON task_run.task_id = task.id
                WHERE task.project_id = :project_id
                AND task.state = 'ongoing'
                AND (task_run.task_id IS NULL OR (task_run.task_id IS NOT NULL AND (task_run.user_id != :user_id OR task_run.user_id IS NULL)))
                ORDER BY random() LIMIT 1;
            ''')

            task_row_proxy = self.db.session.execute(sql, dict(project_id=project_id, user_id=user_id)).fetchone()
            return Task(task_row_proxy)

        # If an anonymous user requests for a random task
        elif user_ip is not None:
            sql = text('''
                SELECT * FROM "task"
                LEFT JOIN "task_run"
                ON task_run.task_id = task.id
                WHERE task.project_id = :project_id
                AND task.state = 'ongoing'
                AND (task_run.task_id IS NULL OR (task_run.task_id IS NOT NULL AND (task_run.user_ip != :user_ip OR task_run.user_ip IS NULL)))
                ORDER BY random() LIMIT 1;
            ''')

            task_row_proxy = self.db.session.execute(sql, dict(project_id=project_id, user_ip=user_ip)).fetchone()
            return Task(task_row_proxy)

        # Normally, this is an unreachable return statement
        return None

The queries make sure that the retrieved random task fulfills the following three rules:

The task's status is ongoing.
The task does not have task runs, or...
... if it has task runs, then none of them are marked as authored by the requesting user's ID or IP.

Randomness is handled directly in the query rather than at the application level with random.choice().

teleyinex commented 8 years ago

GREAT. This scheduler was never meant to be used in a production environment. It is a template for developing new ones like yours @georgeslabreche

georgeslabreche commented 8 years ago

@teleyinex it certainly did help as a template to get started! Thank you!

jtbates commented 8 years ago

@georgeslabreche Is there some reason to prefer that the randomness be handled in the query rather than at the application level?

The suggested code does not solve the problem identified in this issue, which is that pybossa.js assumes that the task scheduler provides a stable ordering with respect to the offset in between task runs. I think this could be fixed by setting the random seed by user. That could be done at the query level with setseed or at the application level (as I did in my pull request). There still seems to be a bug there though, but I haven't had a chance to go back to review it.

teleyinex commented 8 years ago

The best thing is always doing it at the DB level, so that it's handled there, and the scheduler only gets a set of available tasks for the user. Then, PYBOSSA.JS will only take care of taking the current task or the next one for the user from that random set of tasks for the user.

I guess, that what you want is that you need that the randomness should be stable for a given user, so PYBOSSA.JS always uses correctly the offset, is that right?

Scifabric / random-scheduler

pybossa.js assumes stable order #2