UKHSA-Internal / coronavirus-dashboard-api-net-sdk

Coronavirus (COVID-19) in the UK - API Service SDK for .Net
https://coronavirus.data.gov.uk/
MIT License
12 stars 4 forks source link

Is there a way to force pagination? #8

Closed jchannon closed 4 years ago

jchannon commented 4 years ago

At the moment there seems to be only 1 page of data. Is there a way to force the page size to be small so more pages are retrievable. I just want to test the pagination logic in the SDK :)

xenatisch commented 4 years ago

The short answer is no, I have not provisioned the feature in the API code.

The long answer is that CosmosDB doesn't really like pagination, much less so in Python. I had to read their SDK source code (which believe you me, was a chore) to find an undocumented feature to somehow add pagination.

Basically, the idea of OFFSET-LIMIT is out of the window when it comes to CosmosDB. We have millions of records, and whilst it's fine to use OFFSET-LIMIT for the first couple of the pages, the latency increases exponentially when you get to page 4 or 5... even though I have partitioned the data by the releaseTimestamp metric.

That aside, I did eventually get pagination working by means best kept unsaid. However, whenever a query is made against the API micro-service, the response - only if successful - gets cached in Redis (a persistent 6GB instance) for a random period of time between 1800 and 10800 seconds (30 mins to 3 hours). This serves a number of purposes:

So, to answer the original question, making alterations to the pagination process would increase variation of responses, and by extension, the amount of memory needed to cache them. I'm open to new ideas though!

jchannon commented 4 years ago

Yeah I looked into Cosmos and their docs are somewhat surprising as expressed here https://twitter.com/jchannon/status/1293457263366152192

Seems their approach to paging is not really paging and just a continuation token.

And without a shift/change in infrastructure which I assume isnt an option this might not occur. However, I do have a friend who wrote/part of the team that wrote the Azure API Management feature so I could ask for some guidance at some point if interested.

On Tue, 11 Aug 2020 at 23:13, Pouria Hadjibagheri notifications@github.com wrote:

The short answer is no, I have not provisioned the feature in the API code.

The long answer is that CosmosDB doesn't really like pagination, much less so in Python. I had to read their SDK source code (which believe you me, was a chore) to find an undocumented feature to somehow add pagination.

Basically, the idea of OFFSET-LIMIT is out of the window when it comes to CosmosDB. We have millions of records, and whilst it's fine to use OFFSET-LIMIT for the first couple of the pages, the latency increases exponentially when you get to page 4 or 5... even though I have partitioned the data by the releaseTimestamp metric.

That aside, I did eventually get pagination working by means best kept unsaid. However, whenever a query is made against the API micro-service, the response - only if successful - gets cached in Redis (a persistent 6GB instance) for a random period of time between 1800 and 10800 seconds (30 mins to 3 hours). This serves a number of purposes:

  • Reduces the cost of our micro-service infrastructure (still ~8 r/s go through).
  • Reduces the number of calls to CosmosDB, which bills based on request units: basically the number of requests, complexity of the query, and the amount of data in the database... and it ain't cheap.
  • Reduces latency for the most common requests substantially (average API latency is about 12.5ms, and the API serves ~30 million requests a day from all over the world).

So, to answer the original question, making alterations to the pagination process would increase variation of responses, and by extension, the amount of memory needed to cache them. I'm open to new ideas though!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/publichealthengland/coronavirus-dashboard-api-net-sdk/issues/8#issuecomment-672312159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAZVJVD2HUB3C7PWO5FGCDSAG7BFANCNFSM4P3QQC3A .

xenatisch commented 4 years ago

Yup... you hit one of the nails on the head. I am indeed using the continuation token... except, for all intents and purposes, the feature is not documented anywhere in their Python SDK.

Another part of the fun has been to get the datetime field comparisons working... basically, at one point, it just stopped returning any results via the SDK. The exact same query would produce results if you ran it in the web console, but not if you ran it via the SDK... So I looked in the code, and found that a lot of the aggregation in the query is never transmitted to the database. It's processed in the SDK!

It also turns out - and don't ask me how I diagnosed this - that if a timestamp ended with .xxxxx0Z, the match (string comparison) would fail and the service would just return empty results! So now I am adding an arbitrary 5 before the Z just as a workaround just to get that working (the issue was with the zero). Still can't quite make sense of this one... it was really bizarre.

Yet another major headache is that it doesn't support bulk inserts! So yea, I have to deploy 60 million lines (130k records) of data to the database... one record at a time!

The GROUPBY syntax also doesn't work properly and when it does, it costs a hand and a leg to run... uses over 15,000 request units per query - on indexed metrics!

It's a good service with lots of potentials, but they really need to address these major issues.

Thanks for the offer re your friend. I do have a few contacts in Microsoft, including the Cosmos team, but these issues are either well documented, or mainly concern the SDK... they have over-engineered their SDK needlessly and it's caused lots of issues in recent months. That goes for CosmosDB, Functions, and the Storage service.