AtlasOfLivingAustralia / spatial-service

Spatial web services and layer administration console
https://spatial.ala.org.au/ws
2 stars 11 forks source link

Improvements to parallelism in layer loading by multiple people #107

Closed ansell closed 5 years ago

ansell commented 6 years ago

The current parallelism used by the layer loading task manager uses strict FIFO priorities for determining the order tasks are run in.

There are a large number of tasks generated by each load, and if a large number of them are added to the queue at one time, it can mean the end of layer loading for that day, until the intermediate/final tasks are complete and the initial layer loading tasks at the end of the queue are processed.

This means that there is limited opportunity for multiple people to load layers on the same day.

It would be very useful in terms of getting through the layer loading backlog to be able to have a priority queue where the FieldCreation and StandardizeLayers tasks get priority over TabulationCreate and TabulationCreateOne tasks when slaves are allocated new tasks, so layers can be loaded and previewed more interactively, while the Tabulation tasks complete asynchronously.

Tasilee commented 6 years ago

Well-stated @ansell. Given the size of the (large) backlog, an investment in making it happen faster would be nice.

ansell commented 6 years ago

Looking into the code it looks like the infrastructure support may already be present for this:

https://github.com/AtlasOfLivingAustralia/spatial-service/blob/97b0becf52917fe083fecfe16e04917ed27664cb/grails-app/services/au/org/ala/spatial/service/MonitorService.groovy#L146-L152

, but the current configuration gives the tasks above the same priority:

https://github.com/AtlasOfLivingAustralia/spatial-service/blob/462c236126d8fedd554d608ecfaed4444d88e6db/src/main/resources/processes/limits.json#L15-L24

It would be great if this were just a configuration change to implement at this point

ansell commented 5 years ago

I loaded a new layer yesterday about 2pm, and the TabluationCreateOne tasks it triggered as part of the loading process are showing possible performance issues, (or bugs that are preventing them progressing/completing):

screen shot 2019-02-14 at 11 01 20 am

There are 4 execution slots available for task processing in the current configuration on nectar-spatial-staging. Three of these are being used by tasks that started at 2:14pm AEDT yesterday, and one is being used by a task that started at 8:15pm AEDT yesterday. The single layer load has blocked up all 4 execution slots, delaying further layer loads, and this may not improve if prioritisation for newly run tasks is added/tweaked as suggested above.

There are 152 queued tasks, which may contain some further expensive tasks also:

screen shot 2019-02-14 at 11 03 25 am

In the past, some TabulationCreateOne tasks have taken weeks to successfully complete on nectar-spatial-staging.

ansell commented 5 years ago

The 4 running tasks in the comment above are still in that state and the other 152 queued tasks are still queued.

ansell commented 5 years ago

Workaround for both the slowness and the bugs in Cross Tabulation is to disable it for all new layers until further notice.

After disabling Cross Tabulation in the field definition, and cancelling the queued and running tasks, the nectar-spatial-server needs to be restarted to complete the workaround.

adam-collins commented 5 years ago

Queues modified to prevent some tasks (cross tabulation) holding up other tasks.