Cannot upload GeoJSON files with more than ~30k lines

Robinlovelace commented 1 year ago

Good news, we can now go above 15k lines, as per #51 and shown below for good speed-up, nearly 10x compared with normal routing.

However, when I go from 30 to 40k lines it fails again. Any ideas @mvl22 ?

mvl22 commented 1 year ago

I've been investigating this most of the afternoon but haven't yet concluded finding the problem, though I’ve got closer. Resolution of this issue is top priority.

I thought earlier it was a storage limit issue at 16MB, but I don't think it is this, as we are actually using LONGTEXT storage (which is 4GB). Somewhere there is likely to be some processing limit we are hitting. I suspect the GeoJSON unpacking phase is returning an unhandled error as it creates a high memory requirement.

I'm afraid these kinds of limit-hitting scalability scenarios are very hard to predict in advance, especially in asyncronous code. The kinds of files we have seen used so far formed the basis of testing the API functionality, and thus it being pushed as a stable release. But 16MB+ files (which are not unreasonable) are clearly triggering some limit that we have not previously encountered.

Robinlovelace commented 1 year ago

Thanks Martin. As @mem48 says there are ways to reduce the size of the files we're uploading so can look to mitigate our end also.

mvl22 commented 1 year ago

The API now reports an exceeded submission size, e.g.

The submitted geometry is too large. The maximum permitted size is 64MB. The data you submitted is 66.7MB.

as now documented at: https://www.cyclestreets.net/api/v2/batchroutes.createjob/

I've also fixed a scenario where an oversized file was causing a crash causing no response at all due to an error logging bug.

The limit is now documented as 64MB.

However there may still be another problem causing a failure.

mvl22 commented 1 year ago

However, when I go from 30 to 40k lines it fails again.

I have submitted a 17MB file as a test and that has worked. The initial submission of the job succeeds, and the routing threads return results correctly. So if your data is >16MB and <64MB we need further help to reproduce the issue.

If you are still seeing a failure, can you provide us with a failure case GeoJSON file please?

Also, can you give us a clearer definition of the stage at which this is failing? Is this at the initial submission stage, the route planning phase, or the final result?

Robinlovelace commented 1 year ago

If you are still seeing a failure, can you provide us with a failure case GeoJSON file please?

Sure. See reproducible example here: https://github.com/cyclestreets/cyclestreets-r/blob/master/data-raw/test-batch-large-upload.R

If you don't have R installed I can create and send cases that are failing. Will check the failing cases again.

Robinlovelace commented 1 year ago

Heads-up @mvl22 it seems that the previously failing example with 35k lines now works. Can you try reproducing the following?

library(cyclestreets)
library(tidyverse)
# library(cyclestreets)
# devtools::load_all()
# od = readRDS("~/nptscot/npt/inputdata/od_commute_jittered.Rds")
od_raw = pct::get_od()
lsoas = pct::get_pct(layer = "z", national = TRUE, geography = "msoa")
od = od_raw %>%
  slice(seq(110000))
# summary(od$geo_code1 %in% lsoas)
od = od::od_to_sf(x = od, z = lsoas)
od$id = seq(nrow(od))
nrow(od) # 19k
od_100 = od %>%
  slice(seq(100))
od_10k = od %>%
  slice(seq(10000))
od_15k = od %>%
  slice(seq(15000))
od_30k = od %>%
  slice(seq(30000))
od_35k = od %>%
  slice(seq(35000))
od_40k = od %>%
  slice(seq(40000))
od_50k = od %>%
  slice(seq(50000))
od_100k = od %>%
  slice(seq(100000))

sf::write_sf(od_10k, "od_10k.geojson")
sf::write_sf(od_15k, "od_15k.geojson")
sf::write_sf(od_30k, "od_30k.geojson")
fs::file_size("od_30k.geojson")

batch(desire_lines = od_100, wait = FALSE, silent = FALSE, username = "robinlovelace")
batch(desire_lines = od_10k, wait = FALSE, silent = FALSE, username = "robinlovelace")
batch(desire_lines = od_30k, wait = FALSE, silent = FALSE, username = "robinlovelace")
batch(desire_lines = od_35k, wait = FALSE, silent = FALSE, username = "robinlovelace")
batch(desire_lines = od_40k, wait = FALSE, silent = FALSE, username = "robinlovelace")
batch(desire_lines = od_50k, wait = FALSE, silent = FALSE, username = "robinlovelace")
batch(desire_lines = od_100k, wait = FALSE, silent = FALSE, username = "robinlovelace")

system.time({
desire_lines = batch(id = 3957)
})

Robinlovelace commented 1 year ago

Next step: try with 50k, then 100k, then 150k lines.

Robinlovelace commented 1 year ago

The 100k example fails:

batch(desire_lines = od_100k, wait = FALSE, silent = FALSE, username = "robinlovelace")
POSTing the request to create and start the job
Posting to: https://api.cyclestreets.net/v2/batchroutes.createjob?key=xxx
Job id: 
Error in batch(desire_lines = od_100k, wait = FALSE, silent = FALSE, username = "robinlovelace") : 
  Check your credentials, try again, and maybe contact CycleStreets

Robinlovelace commented 1 year ago

The file is less than 64 MB:

> sf::write_sf(od_100k, "od_100k.geojson")
> fs::file_size("od_100k.geojson")
51.6M

Robinlovelace commented 1 year ago

Next step: try with smaller example containing same desire lines as suggested by @mem48.

mvl22 commented 1 year ago

Please put the od_100k.geojson file somewhere so I can download it.

Robinlovelace commented 1 year ago

Please put the od_100k.geojson file somewhere so I can download it.

On the case. Doing this is a reproducible way with the following command:

system("gh release upload v0.6.0 od_100k.geojson")

Robinlovelace commented 1 year ago

Just need to figure out how to get the URL of that uploaded file now...

Robinlovelace commented 1 year ago

See releases here: https://github.com/cyclestreets/cyclestreets-r/releases/tag/v0.6.0

Robinlovelace commented 1 year ago

The specific file can be found here: https://github.com/cyclestreets/cyclestreets-r/releases/download/v0.6.0/od_100k.geojson

Robinlovelace commented 1 year ago

Next step: try with smaller example containing same desire lines as suggested by @mem48.

Update on this, tried with the following file but it still failed. This makes me suspect it's due to the number of OD pairs not the file size. Reproducible example building on the reproducible code above:

od_100k_minimal = od_100k %>% 
  select(geo_code1, geo_code2)
sf::write_sf(od_100k_minimal, "od_100k_minimal.geojson")
fs::file_size("od_100k_minimal.geojson") # 22 MB
batch(desire_lines = od_100k_minimal, wait = FALSE, silent = FALSE, username = "robinlovelace")

Robinlovelace commented 1 year ago

Heads-up @mvl22 another small file has just been uploaded, this should now be in the releases:

od_100k_minimal = od_100k %>% 
  transmute(id = seq(n()))
sf::write_sf(od_100k_minimal, "od_100k_minimal.geojson", delete_dsn = TRUE)
fs::file_size("od_100k_minimal.geojson") # 22 MB
batch(desire_lines = od_100k_minimal, wait = FALSE, silent = FALSE, username = "robinlovelace")
system("gh release upload --clobber v0.6.0 od_100k_minimal.geojson")

Robinlovelace commented 1 year ago

Good news and update that should help you diagnose the bug your side @mvl22, confirming it is a file size not n. lines issue. 50k works fine when the file is under 9.4 MB but fails when has additional attributes:

od_50k_minimal = od_50k %>%
  transmute(id = seq(n()))
sf::write_sf(od_50k_minimal, "od_50k_minimal.geojson", delete_dsn = TRUE)
fs::file_size("od_50k_minimal.geojson") # 9.4 MB
batch(desire_lines = od_50k_minimal, wait = FALSE, silent = FALSE, username = "robinlovelace")
system("gh release upload --clobber v0.6.0 od_50k_minimal.geojson")

Robinlovelace commented 1 year ago

Seems that the ~16 MB upper limit could still be in operation:

> sf::write_sf(od_50k, "od_50k.geojson", delete_dsn = TRUE)
> fs::file_size("od_50k.geojson")

26 MB

> batch(desire_lines = od_50k, wait = FALSE, silent = FALSE, username = "robinlovelace")
POSTing the request to create and start the job
Posting to: https://api.cyclestreets.net/v2/batchroutes.createjob?key=xxx
Job id: 
Error in batch(desire_lines = od_50k, wait = FALSE, silent = FALSE, username = "robinlovelace") : 
  Check your credentials, try again, and maybe contact CycleStreets

mvl22 commented 1 year ago

https://github.com/cyclestreets/cyclestreets-r/releases/download/v0.6.0/od_100k.geojson

Just to say that posting that job (with the 100k GeoJSON containing verbose attributes) via the UI, using the URL retrieval method, does succeed, so the system is correctly processing the data after successful upload.

I am now trying an API-based submission.

Incidentally, I can I suggest the package could use some better error reporting:

Error in batch(desire_lines = od_50k, wait = FALSE, silent = FALSE, username = "robinlovelace") : 
  Check your credentials, try again, and maybe contact CycleStreets

basically looks like "Something went wrong, could be anything".

Good client code will differentiate between no response (e.g. network failure, error 500), or a response with an error code, and if there is an error code, will display the error. For instance, if the issue is with credentials, you would get a response with an error: ... string saying that, so it's probably not ideal to have an error message that implies it could be that, when if it is that there would have been an explicit message saying so.

Robinlovelace commented 1 year ago

Good client code will differentiate between no response (e.g. network failure, error 500), or a response with an error code, and if there is an error code, will display the error. For instance, if the issue is with credentials, you would get a response with an error: ... string saying that, so it's probably not ideal to have an error message that implies it could be that, when if it is that there would have been an explicit message saying so.

Agreed. Will work on improving the reporting.

mvl22 commented 1 year ago

OK, I've found the place a problem is occurring, in /v2/batchroutes.createjob, in the GeoJSON unpacking phase, causing an out of memory issue. Leave it with me.

Basically the submission is coming through, but the process is crashing, and so you will almost certainly be getting an error 500 with no proper response.

mvl22 commented 1 year ago

I believe this is now fixed. Thanks for your forbearance on this.

I traced the problem to the way that the high memory mode was being set inconsistently: the UI was getting it but not the API context. I've moved the relevant code, and also rewritten the memory limit level to be based more consistently on the available GeoJSON setting and with a higher overhead.

For background, there are basically three limits in play here, all of which must be sufficient to work:

1) The posted HTTP limit size to the API; this was already sufficient; 2) The storage size for the GeoJSON data; this was high enough but any value exceeding was not being properly reported as an error, and this limit had not been stated on the API documentation page; 3) The memory required to process the GeoJSON data, which is now based on the max storage size but was previously set to a hard-coded amount that hadn't been adjusted in line with changes to that.

As you can probably imagine, these are all quite difficult to test in terms of scaling without having real-world examples of large datasets and without seeing variance of the number of routes vs the amount of data per feature. Thanks for providing those.

The magic number though is the 64MB of data size as documented.

There may still be a bit of adjustment needed, and with more time I would like to do more testing, but I think it should be significantly more reliable now.

In general if you submit GeoJSONs without unnecessary fields you will obviously reach any of these overheads far less quickly, and therefore achieve far more rows per job. It is not necessary to strip out such fields, and it is obviously nice to be able to re-use the same file as used in other operations in a workflow. But if the package does this in memory by default, the 64MB string size limit will be less likely ever to be hit by a user. All those extra fields take up space, and all have to be expanded inefficiently when processing the GeoJSON, which is quite a garbagey operation generally. It would be interesting to know how many rows your dataset can get in 64MB once those extra fields (used only elsewhere) are stripped.

Robinlovelace commented 1 year ago

Great work!

Confirmed on the R side:

batch(desire_lines = od_100k_minimal, wait = FALSE, silent = FALSE, username = "robinlovelace")
POSTing the request to create and start the job
Posting to: https://api.cyclestreets.net/v2/batchroutes.createjob?key=xxx
Job id: 4115
Sending data, wait...
Routing not complete
Routing job sent, check back in around 56 minutes if you've just sent this
Check at www.cyclestreets.net/journey/batch/ for route id: 4115
[1] 4115

Robinlovelace commented 1 year ago

And on the web UI.

mvl22 commented 1 year ago

Great.

Obviously that area in this dataset isn't covered on the specific server you're using. But yes the GeoJSON has been processed fine clearly.

cyclestreets / cyclestreets-r

Cannot upload GeoJSON files with more than ~30k lines #55