May not fully get there with documentation today but thought worth getting eyes on this. I've added stripped down log in reverse order (most recent at the top) for trying to work out what configuration should be set.
TODO:
[x] Configure orthanc settings to be set using build args
[x] Document configuration in READMEs
[x] fix imaging api tests
[x] fix system tests
Summary of learnings and changes
I think it was overloading the orthanc raw with pending jobs that caused it to freeze, which caused the imaging api to block and time out from rabbitmq heartbeat. We now check to see if
Early return in a job is failed in orthanc raw rather than waiting for timeout
use async rest API so that we're blocking less, and can resend images that already exists but haven't been exported asynchronously
Use loguru for logging, mostly because it gives nice defaults and is easy to configure and we don't have to mess around as much with it. Also allows fstring format style. We can roll this out to the other services gradually
Configuration changes
Will update as I go along. When rate drops, it's that there are timeouts from the VNA completing the transfer job within 2 minutes
after async rest
100 max job history
25 concurrent jobs
5 dicom threads (default)
50 http threads (default)
50 messages in flight
3 m/s from queue
Didn't really change much, expect that its VNA load based, ~25 messages were being nacked and requeued which makes sense at least.
Tried after 7pm and not much of a change
after async rest
100 max job history
50 concurrent jobs
5 dicom threads (default)
50 http threads (default)
100 messages in flight
3 m/s from queue
Doesn't seem to help, though was requeueing 50 messages a second so there's more capacity. Maybe try fewer concurrent jobs and messages in flight to see if that gets us back to ~2 messages per second being confirmed, otherwise may just be the rest of the load on the VNA
after async rest
100 max job history
50 concurrent jobs
5 dicom threads (default)
50 http threads (default)
50 messages in flight
3 m/s from queue
Didn't really give much of an increase, rarely hit rate limiting by the token bucket so maybe try increasing max in flight?
after async rest
100 max job history
30 concurrent jobs
5 dicom threads (default)
50 http threads (default)
50 messages in flight
3 m/s from queue
Number of concurrent jobs maybe increased? will try 50 to check
after async rest
100 max job history
10 concurrent jobs
5 dicom threads (default)
100 http threads
100 messages in flight
3 m/s from queue
max job history increase stopped error on jobs not being found, pending messages in orthanc slowing down processing so increase number of concurrent jobs to 30 next time
after async rest
10 max job history (default)
10 concurrent jobs
5 dicom threads (default)
100 http threads
100 messages in flight
3 m/s from queue
Found errors where the job didn't exist in orthanc anymore, so increase max job history so it still exists after success
after async rest
10 max job history (default)
50 concurrent jobs
50 dicom threads
200 http threads
200 messages in flight
3 m/s from queue
Not a massive increase, will drop number of threads and concurrent jobs
before async rest
10 max job history (default)
10 concurrent jobs
10 dicom threads
100 http threads
5 messages in flight
rate 0.6 then 1 message/second
Initial rate of processing, discovered that was hanging a fair bit on rest calls so made them async using aio http
May not fully get there with documentation today but thought worth getting eyes on this. I've added stripped down log in reverse order (most recent at the top) for trying to work out what configuration should be set.
TODO:
Summary of learnings and changes
Configuration changes
Will update as I go along. When rate drops, it's that there are timeouts from the VNA completing the transfer job within 2 minutes
Didn't really change much, expect that its VNA load based, ~25 messages were being nacked and requeued which makes sense at least.
Tried after 7pm and not much of a change
Doesn't seem to help, though was requeueing 50 messages a second so there's more capacity. Maybe try fewer concurrent jobs and messages in flight to see if that gets us back to ~2 messages per second being confirmed, otherwise may just be the rest of the load on the VNA
Didn't really give much of an increase, rarely hit rate limiting by the token bucket so maybe try increasing max in flight?
Number of concurrent jobs maybe increased? will try 50 to check
max job history increase stopped error on jobs not being found, pending messages in orthanc slowing down processing so increase number of concurrent jobs to 30 next time
Found errors where the job didn't exist in orthanc anymore, so increase max job history so it still exists after success
Not a massive increase, will drop number of threads and concurrent jobs
Initial rate of processing, discovered that was hanging a fair bit on rest calls so made them async using aio http