Open jtbates opened 8 years ago
Big thanks for the deep analysis. I will use your steps for test and validation once rework on the random scheduler is done. The random scheduler is currently not intended to use for production setups (it breaks PYBOSSA currently) and needs to be migrated to the current PYBOSSA.
For our Decode Darfur microtasking project, I implemented this random task fetching function in task_repository.py:
def get_random_ongoing_task(self, project_id, user_id, user_ip):
# If an authenticated user requests for a random task
if user_id is not None:
sql = text('''
SELECT * FROM "task"
LEFT JOIN "task_run"
ON task_run.task_id = task.id
WHERE task.project_id = :project_id
AND task.state = 'ongoing'
AND (task_run.task_id IS NULL OR (task_run.task_id IS NOT NULL AND (task_run.user_id != :user_id OR task_run.user_id IS NULL)))
ORDER BY random() LIMIT 1;
''')
task_row_proxy = self.db.session.execute(sql, dict(project_id=project_id, user_id=user_id)).fetchone()
return Task(task_row_proxy)
# If an anonymous user requests for a random task
elif user_ip is not None:
sql = text('''
SELECT * FROM "task"
LEFT JOIN "task_run"
ON task_run.task_id = task.id
WHERE task.project_id = :project_id
AND task.state = 'ongoing'
AND (task_run.task_id IS NULL OR (task_run.task_id IS NOT NULL AND (task_run.user_ip != :user_ip OR task_run.user_ip IS NULL)))
ORDER BY random() LIMIT 1;
''')
task_row_proxy = self.db.session.execute(sql, dict(project_id=project_id, user_ip=user_ip)).fetchone()
return Task(task_row_proxy)
# Normally, this is an unreachable return statement
return None
The queries make sure that the retrieved random task fulfills the following three rules:
Randomness is handled directly in the query rather than at the application level with random.choice().
GREAT. This scheduler was never meant to be used in a production environment. It is a template for developing new ones like yours @georgeslabreche
@teleyinex it certainly did help as a template to get started! Thank you!
@georgeslabreche Is there some reason to prefer that the randomness be handled in the query rather than at the application level?
The suggested code does not solve the problem identified in this issue, which is that pybossa.js assumes that the task scheduler provides a stable ordering with respect to the offset in between task runs. I think this could be fixed by setting the random seed by user. That could be done at the query level with setseed or at the application level (as I did in my pull request). There still seems to be a bug there though, but I haven't had a chance to go back to review it.
The best thing is always doing it at the DB level, so that it's handled there, and the scheduler only gets a set of available tasks for the user. Then, PYBOSSA.JS will only take care of taking the current task or the next one for the user from that random set of tasks for the user.
I guess, that what you want is that you need that the randomness should be stable for a given user, so PYBOSSA.JS always uses correctly the offset, is that right?
While the user is working on a task, pybossa.js loads the next task using the offset parameter. After the user completes the task and the task run is submitted, pybossa.js assumes that the task it retrieved previously at offset=1 is now at offset=0 and that newtask?offset=1 will return a different task. If Issue #1 is fixed but the scheduler still violates this assumption, there can be an error where the task presenter tells the user that they have completed all tasks while there is still one left.
Here is what could happen:
offset=0
and y atoffset=1
.newtask?offset=1
and the response is again y.newtask?offset=2
. The response is empty as there are only two candidate tasks left.