Implement bulk import queues

paynejd commented 4 years ago

Sometimes bulk imports are submitted that have dependencies -- meaning, that the first one needs to have completed processing before another begins processing. How can we prevent a specific bulk import from starting until another specific import has completed processing?

For example, could I submit a bulk import with a specific key, such as PEPFAR-MOH-Malawi-FY19. Then, if I submit a second (or third or fourth) bulk import with the same key, then it would not begin processing until a previous bulk import in the queue has completed processing?

We also want to be able to query the bulk import system to ask -- "Is there a bulk import already being processed or in the queue that matches this key?" (should be for the same user)

Does celery provide any support for this or would we need to build this custom? (or hybrid?)

paynejd commented 4 years ago

From @rkorytkowski:

So there's a number of ways Celery can help us solve the problem of queuing tasks so that they wait for results from other tasks.

One is chaining/grouping https://docs.celeryproject.org/en/stable/reference/celery.html#celery.chain, but we need to know all tasks to be executed up-front before starting the first one.

Another one is dynamically creating queues which have just one worker. The feature is called automatic routing together with dynamic creation of queues. https://docs.celeryproject.org/en/latest/userguide/routing.html#routing-automatic We could basically route all tasks, which need to be executed one after another to a new queue defined with some unique key. Having just one worker for such a queue would guarantee they are executed one after another. The question is what should be our unique key. Do we want to have a separate queue per user or per some unique key passed with the bulk import task?

paynejd commented 4 years ago

Based on our discussion today, the second option that Rafal identified (dynamic creation of queues with just one worker) would meet our requirements for queuing. It would also provide a way for us to expose an API endpoint for a client to query whether an import is underway/queued for a particular bulk import key.

Acceptance Criteria:

Allow a client to optionally include an import_queue_key with their bulk import request, (see example)
- Should the key be part of the URL, or an optional URL parameter?
- Should be a mnemonic, following the same syntax as elsewhere in OCL
- Should be optional
- Should be tied to a user -- meaning that if two users use the same bulk import key, they would each have their own separate queue
```
/manage/bulkimport/[import_queue_key]
```
If no import_queue_key is provided, a bulk import is handled in the same way that it is currently -- it is added to the standard queue if a regular user, or to the priority queue if the root user; bulk imports in these queues may be safely processed in parallel
If an import_queue_key is provided, the following occurs:
- The script checks if a queue with the same import_queue_key already exists for the requesting user
- If the import_queue_key already exists, the bulk import is added to the queue
- If the import_queue_key does not exist, a new queue is created and the bulk import is added to it
When a bulk import that is not part of a standard queue has finished processing and the queue is now empty, the queue should be deleted
A new API endpoint is needed to allow a client to request whether an import is underway or if there are imports in the queue for a given import_queue_key
- What is the best way to expose this new feature?

@rkorytkowski @harpatel1 Could you review this? And could you suggest API endpoints for the new requests?

rkorytkowski commented 4 years ago

All sounds good. Proposed endpoints:

GET /manage/bulkimport/ lists all tasks on all queues that the user has access to
GET /manage/bulkimport/:queue/ lists all tasks on :queue for the authenticated user
GET /manage/bulkimport/?task=task_uuid returns the task state (as it does now) with additional queue field
POST /manage/bulkimport/ adds the task to a default queue
POST /manage/bulkimport/:queue/ adds the task to :queue, which is a user-assigned mnemonic

paynejd commented 4 years ago

A few follow up questions for @rkorytkowski @harpatel1:

What will the behavior be for a superuser/root for GET /manage/bulkimport/?
Should we specify two reserved queue keys for the standard and high priority queues, like __standard__ and __priority__? It would be awesome for an admin to be able to request the list of queued tasks in the standard queues using these...
Should we provide a way for a superuser to be able to request a list of queued tasks for another user's queue? Since it uses the request's auth token, there is currently no way to do this. In this case, we could possibly add a URL param for the username?
Any reason to support both POST and PUT?

harpatel1 commented 4 years ago

This looks good.

GET /manage/bulkimport/ would list all tasks of a root user by default. We can have it return all users tasks or filter by username, but the issue is to get a task for a given username. We store username as a part of a task_id, so one way of doing this would be to first get all task_Ids and then filter task_id by username.
I think in this case having the support for either PUT or POST is fine, I can't think of a reason to support both.

paynejd commented 4 years ago

For root user:
- GET /manage/bulkimport/ will list all tasks across all queues with the option to filter by username
- GET /manage/bulkimport/:queue/ will list all tasks for specified queue with the option to filter by username
POST to create new import tasks

rkorytkowski commented 4 years ago

@paynejd, deployed to QA. You might want to play with it. There are currently 6 workers for processing bulk imports including one which is reserved for root. Each worker processes one task at a time from a dedicated queue. If not specified, a task is added to a random queue.

paynejd commented 4 years ago

@rkorytkowski This is looking good. A few questions:

Is it possible for me to view the queues in real-time? I think that would be via flower? That would sure make it easier to test
Could we update the response when POSTing a new bulk import to match the response when GETting the status -- i.e. can we add "queue" and "username" fields?
Could we also add a timestamp of when a bulkimport was first POSTed to all of the responses?

While we're thinking about it-- what would be involved in adding a progress indicator for imports marked as "STARTED"? Not needed now, but probably worth defining

rkorytkowski commented 4 years ago

Yes, you can do it via flower.
and 3. I'll add those fields.
A progress should be exposed by OclFlexImporter and then it's trivial to add that to the response.

paynejd commented 4 years ago

Great, let me know when the new fields are deployed and we can chat on slack for how to sign into flower.

Let's discuss exposing the progress via OclFlexImporter on tomorrow's dev call. Thank you!

On Wed, Jul 29, 2020 at 8:13 AM Rafal Korytkowski notifications@github.com wrote:

Yes, you can do it via flower.

and 3. I'll add those fields.

A progress should be exposed by OclFlexImporter and then it's trivial to add that to the response.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OpenConceptLab/ocl_issues/issues/253#issuecomment-665627916, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJCOOP5D2IWE3AECZYMUP3R6AG73ANCNFSM4OVWRWGQ .

paynejd commented 4 years ago

For import progress, OclFlexImporter has an instance variable named import_results. You can get the current progress and total lines like this:

importer = OclFlexImporter()
# set up importer and start processing here...
# print progress
print "%s of %s" % (
    importer.import_results.count,
    importer.import_results.total_lines)

rkorytkowski commented 4 years ago

A small change is needed in ocldev to report progress. See https://github.com/OpenConceptLab/ocldev/commit/717f2b46b8c1d4cacc56b05f15a551bbfa177304

Let's do that in a separate issue.

I'll go ahead and deploy this one to staging.

paynejd commented 4 years ago

Looks like there is a bug in how the queue key and the username are parsed. See below. The username should be datim-admin and queue should be default

GET https://api.staging.openconceptlab.org/manage/bulkimport/

[
    {
        "queue": "admin-default",
        "username": "datim",
        "state": "SUCCESS",
        "task": "aee2dcd4-8416-48f2-8e6d-54d3ed31a3e0-datim-admin-default"
    },
    {
        "queue": "admin-default",
        "username": "datim",
        "state": "SUCCESS",
        "task": "4cde7fb0-33a3-476d-b36a-b12e168e764e-datim-admin-default"
    },
    {
        "queue": "admin-default",
        "username": "datim",
        "state": "SUCCESS",
        "task": "2605f513-617b-4e53-9f38-53f4a25a3bd0-datim-admin-default"
    }
]

maurya commented 4 years ago

Currently two concurrent import requests is leading to an error - 1st import -

{
  "status": "Success",
  "imap_import_status_url": "https://test.ohie.datim.org:5000/ocl-imap/BI/status/2976905a-dd0e-4f91-bff3-832e4e6b2451-datim-admin-DATIM-MOH-BI-FY19/",
  "ocl_bulk_import_task_id": "2976905a-dd0e-4f91-bff3-832e4e6b2451-datim-admin-DATIM-MOH-BI-FY19",
  "country_org": "DATIM-MOH-BI-FY19",
  "period": "FY19",
  "ocl_bulk_import_status_url": "https://api.staging.openconceptlab.org/manage/bulkimport?task=2976905a-dd0e-4f91-bff3-832e4e6b2451-datim-admin-DATIM-MOH-BI-FY19",
  "country_code": "BI",
  "imap_export_url": "https://test.ohie.datim.org:5000/ocl-imap/BI/FY19/",
  "country_name": "Burundi",
  "message": "IMAP successfully queued for bulk import into OCL. Request IMAP export after bulk import is processed or request import status."
}

but the second import request that is made right after results in a 500 error -

{
    "status": "Error",
    "country_org": "DATIM-MOH-BI-FY19",
    "period": "FY19",
    "country_code": "BI",
    "country_name": "Burundi",
    "message": "500 Server Error: INTERNAL SERVER ERROR for url: https://api.staging.openconceptlab.org/manage/bulkimport/DATIM-MOH-BI-FY19/"
}

The extra information is created by python scripts in https://github.com/OpenConceptLab/ocl_datim but the requirement is we need to receive a 409 response from OCL when a second concurrent request is sent while the country code/Period is being processed

harpatel1 commented 4 years ago

Stack trace on server: INFO 2020/07/31 13:55:31 [middlewares.py:22 process_request()] datim-admin 172.19.0.1 POST /manage/bulkimport/DATIM-MOH-BI-FY19/ ERROR 2020/07/31 13:55:31 [base.py:210 handle_uncaught_exception()] Internal Server Error: /manage/bulkimport/DATIM-MOH-BI-FY19/ Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 113, in get_response response = callback(request, *callback_args, callback_kwargs) File "/usr/local/lib/python2.7/site-packages/django/views/generic/base.py", line 68, in view return self.dispatch(request, *args, *kwargs) File "/usr/local/lib/python2.7/site-packages/django/views/decorators/csrf.py", line 77, in wrapped_view return view_func(args, kwargs) File "/usr/local/lib/python2.7/site-packages/rest_framework/views.py", line 327, in dispatch response = self.handle_exception(exc) File "/usr/local/lib/python2.7/site-packages/rest_framework/views.py", line 324, in dispatch response = handler(request, *args, **kwargs) File "/code/manage/views.py", line 132, in post task = queue_bulk_import(request.body, import_queue, username, update_if_exists) File "/code/tasks.py", line 78, in queue_bulk_import task = bulk_import.apply_async((to_import, username, update_if_exists), task_id=task_id, queue=queue_id) File "/usr/local/lib/python2.7/site-packages/celery_once/tasks.py", line 85, in apply_async raise e AlreadyQueued

rkorytkowski commented 4 years ago

Thanks for testing. Addressed both issues in https://github.com/OpenConceptLab/oclapi/commit/9a1e602ca0dd48dc01434582fc6730474d9f23d1.

rkorytkowski commented 4 years ago

It's been deployed to all envs now.

paynejd commented 4 years ago

Testing in the staging environment is looking great! One final request: could we add a timestamp for when the bulk import request was posted?

OpenConceptLab / ocl_issues

Implement bulk import queues #253