Support O(100) cities - Githubissues

dabreegster commented 4 years ago

We've been adding a few other cities slowly and with considerable effort. What would it take to maintain a few hundred?

Downloading optional content

a UI for selecting extra maps to download
persisting game Options between game versions
Adapt the downloading functions from updater or importer
- Glue to ctx.loading_screen
- Figure out how to make web-compatible

Storing the maps

Currently just the latest (github master) version in Dropbox
- Likely need to keep 3 versions: github master, the latest binary release, and the last binary release (#195)
Dropbox sucks, probably switch to S3
- There's a unique URL for each file shared from Dropbox, so data/MANIFEST.txt has to store the mapping
- the Dropbox daemon crashes constantly when uploading lots of new files
- want one URL, aka, https://abstreet.s3.us-west-2.amazonaws.com/latest/system/maps/southbank.bin
- ideally use Unix rsync and ditch most of updater --upload
Namespace data/system/maps/ better

Map config

Could just start from https://download.bbbike.org/osm/bbbike/
People can contribute better bounding polygons
Move to text config, instead of all the code in importer
- For curated cities like Seattle, keep the code of course, but for the majority of the "second tier" cities, just have basic config

Maintaining the maps

My laptop has a core or two, but probably it's a bit intense to regularly run the importer for hundreds of maps
Also I'm the only person who can upload new data to Dropbox currently, doesn't scale for potential team growth
Ideally, a CI-like server would do the work
- But it's weird to make this asynchronous; when a commit with a binary incompatible change goes in, the data should be uploaded already
- Maybe locally, a dev can send the server a command to regen everything, stage it somewhere. Then they send another command to "commit" the data, and then push their change at the same time

michaelkirk commented 4 years ago

Do the released versions include the bundled maps? Or is it expected that the user/application will download maps separately?

Just wondering if we need to leave archived versions of the maps around for old app versions.

dabreegster commented 4 years ago

Do the released versions include the bundled maps?

Yes, but only for the few "curated" maps, aka Seattle and maybe one or two more. To make the initial install experience even quicker, it could be worth removing a few maps from that too.

Just wondering if we need to leave archived versions of the maps around for old app versions.

I'm hesitant to keep more than a few old versions around, just for storage/price reasons

matkoniecz commented 4 years ago

What would it take to maintain a few hundred?

Also many cities would require new features to work as expected - trams in Kraków, aerialways used as major public transport in some places, movable bridges are important in other, congestion pricing in London and so on.

I would expect that adding 100 cities would require adding say 25 major features requiring as much work as tram support triggered by Kraków map. Alternative would be to avoid cities with special features or have them in a half-broken state.

Map config

Is it planned to add just city centers or multiple regions as Seattle? In both cases some tool to easily create boundaries would be nice (by dragging nodes over map, displayed on a website)

Traffic model

Every single new map will also reveal blatant problems in the traffic data model. (currently any map will do this, but even if it will get improved to work decently on some maps then any new map will revel hilarious mismatch with reality)

dabreegster commented 4 years ago

I'll mention that I'm not planning to prioritize this work anytime soon; I just wanted to write down some of the ideas.

Alternative would be to avoid cities with special features or have them in a half-broken state.

In the short term, the goal is just to get them started in a partly broken state. Ideally more people would become interested in the project and help implement the new features.

Is it planned to add just city centers or multiple regions as Seattle?

Just city centers to start, or maybe the entire region, as defined by the bbike.org extract. But I'd like all cities to split into multiple regions like Seattle, and I think it's best done by somebody familiar with that place. geojson.io lets you draw multiple polygons and already has a full world map, so I think it'll suffice for now. There is "internal dev tools > edit a polygon" in the game, but it requires starting with the larger region, and is more useful for fine-tuning the boundaries after they're initially drawn.

Every single new map will also reveal blatant problems in the traffic data model.

In the prolet robot travel demand, you mean? Definitely. This is an opportunity to find lots of problems with it quickly, which will hopefully shape its development better.

natrius commented 3 years ago

What about using Nextcloud instead of Dropbox for the files? There are free hosted versions around, even paid or free as well. Here you can choose or look at options https://nextcloud.com/providers/

matkoniecz commented 3 years ago

There are free hosted versions around

"2GB of free storage" is not too useful, in general free file hosting is useful only when you are extremely unwilling to pay or want to text files or something similarly lightweight

natrius commented 3 years ago

Dropbox 2GB when using free. 2.000GB when paying 10 Euros. When using https://cloud.tab.digital its 8GB for free and 128GB for 5 Euro per month. That was not the point, there are multiple providers and it is possible to choose. I was just suggesting it for the Dropbox daemon crashes constantly when uploading lots of new files this. It may be worth to at least try it.

Maybe seafile would be better as well, as just filesync is needed and nothing else from the feature-set of nextcloud. Here is an provider https://luckycloud.de/en/preise-cloud-speicher-und-funktionen

I'm suggesting options and it seems to me you try to nitpick with comparing a paid product from Dropbox with one specific free product from an Nextcloud-Provider?

EDIT: Just to clarifiy to @dabreegster clubtab is using Nextcloud. Nextcloud is a service you are able to host on your own server. If you need an account for a short test i'm willing to create an account on my server, but i guess a free on cloud.tab is less hassle :D Seafile is a service on its own, also possible to host on your own server. Or use an provider doing it for you, like luckycloud. :) Thanks for your answer.

dabreegster commented 3 years ago

Thanks for the pointer to nextcloud, cloudtab, seafile, etc! I'll take a closer look when I start working on this. Price isn't a strong factor, as long as it's reasonable. Biggest priorities are to sync files easily and be able to construct URLs without having to keep an extra mapping. The project uses Dropbox now simply because I already had an account for other reasons. :P

dabreegster commented 3 years ago

Starting to play around with the process for generating loads of maps. I want to try out https://taskfile.dev and some others for job management, but at the moment, gnu parallel is working fine to import lots of files 4 at a time, with separated log files: for x in ~/bbike_extracts/*; do echo "./import.sh --oneshot=$x --skip_ch >basename $x.log 2>&1"; done | parallel --bar -j4

dabreegster commented 3 years ago

Some of this work is coming together (mainly motivated by having more maps for OSM Connect 2020), but I still don't know how to organize things. Thinking through this again...

What does the end state look like?

the binary release of the game (and building from source by default + running the updater) should just include the "first tier" Seattle maps
on native, the city picker UI should allow the player to opt into more maps and download them from the game
on web, only the tiny map is bundled in and everything else comes from S3, so all maps should be listed here

So where do files need to wind up?

currently, both Dropbox (for native) and S3 (for web)
seems like a good forcing function to cutover both to just S3

Storage size concerns:

About 100 cities from my test run are about 16GB in data/system/maps, 8GB in data/input/raw_maps. I deleted the tmp directory already, but I recall the .osm extracts were ~100GB
I think it's time to revisit the assumption that Dropbox (or S3) should have a cache of the raw input files. It's nice for reproducibility, but I could just point to the geofabrik URL and give the timestamp listed in the file too.
so I think data/input/raw_maps should stop being managed by the updater entirely... and all of the .osm input files, and debugging artifacts of the importer, like snapped_cycleways
But the raw .kml and derived .bins from King County GIS and things like the Soundcast data for Seattle need to stay managed by updater, because the original source keeps changing URLs or updating or disappearing
so it's time to exclude most data/input files from being managed by updater

Maintaining all of the maps:

For the moment, I'm still pretty much the only one modifying the map importer and rerunning it. I don't think it's necessary to solve the problem of letting any developer rerun the importer on shared infrastructure.
Some code changes may break some of the extra maps. Basically unless it's one of the few cities manually imported today, I think that'll be fine for now; they're best-effort.
so I need to figure out how painful it is for me to regenerate a few hundred maps locally and upload just the resulting data/system/maps

michaelkirk commented 3 years ago

Storage size concerns:

A note - if you wanted just a bit more wiggle room, the map files tend to compress to about 1/3 their original size.

-rw-rw-r--  1 mkirk  staff    43M Sep 21 14:21 ballard.bin
-rw-r--r--  1 mkirk  staff    16M Oct 27 16:05 ballard.bin.gz
-rw-rw-r--  1 mkirk  staff    23M Sep 21 14:30 downtown.bin
-rw-r--r--  1 mkirk  staff   8.0M Oct 27 16:05 downtown.bin.gz
-rw-r--r--  1 mkirk  staff   247M Sep 21 14:26 huge_seattle.bin
-rw-r--r--  1 mkirk  staff    90M Oct 27 16:05 huge_seattle.bin.gz
-rw-rw-r--  1 mkirk  staff    20M Sep 21 14:26 lakeslice.bin
-rw-r--r--  1 mkirk  staff   7.1M Oct 27 16:05 lakeslice.bin.gz
-rw-r--r--  1 mkirk  staff    57M Sep 22 14:56 los_angeles_midwest.bin
-rw-r--r--  1 mkirk  staff    21M Oct 27 16:05 los_angeles_midwest.bin.gz
-rw-rw-r--  1 mkirk  staff   3.5M Sep 21 14:27 montlake.bin
-rw-r--r--  1 mkirk  staff   1.2M Oct 27 16:05 montlake.bin.gz
-rw-rw-r--  1 mkirk  staff    53M Sep 21 14:37 south_seattle.bin
-rw-r--r--  1 mkirk  staff    19M Oct 27 16:05 south_seattle.bin.gz
-rw-rw-r--  1 mkirk  staff   9.5M Sep 21 14:27 udistrict.bin
-rw-r--r--  1 mkirk  staff   3.3M Oct 27 16:05 udistrict.bin.gz
-rw-rw-r--  1 mkirk  staff    47M Sep 21 14:28 west_seattle.bin
-rw-r--r--  1 mkirk  staff    17M Oct 27 16:05 west_seattle.bin.gz

dabreegster commented 3 years ago

A note - if you wanted just a bit more wiggle room, the map files tend to compress to about 1/3 their original size.

Good point! The files are currently stored compressed in Dropbox, and the updater manages the transformation. But the S3 files for web aren't stored compressed. I will try some experiments to do on-the-fly decompression when the web client loads a file.

dabreegster commented 3 years ago

http://abstreet.s3-website.us-east-2.amazonaws.com/osm_demo/ is live with 123 maps. Gzipping on S3 is a huge help; 3GB instead of 8. A few simple next steps:

cutover from Dropbox to S3 for the updater too
maintain s3://abstreet/{dev, v0.2.17, v0.2.16, etc}. Data for the last few binary release versions can be backed up easily.

Then that paves the way for the native version to allow downloading extra cities.

dabreegster commented 3 years ago

I think it's time to revisit directory structure.

1) A flat list of data/system/maps fails as soon as two regions have "downtown" or something like that 2) I'm not sure how much hierarchy the UI or filesystem should expose for sorting cities -- na/usa/wa/seattle/downtown? europe/uk/leeds/center? It'll also be a question of extracting this hierarchy from OSM or somewhere else, although maybe there can be a little bit of manual mapping that happens. 3) The updater tool has a vaguely structured mapping from files to city, to figure out where optional data belongs. Similar to the data/input/$city structure, I think it may be time to revisit the concept of data/system/$city for organizing maps, scenarios, prebaked results, etc.

dabreegster commented 3 years ago

Wound up surging to >30 cities as part of the actdev work. Feels like it's time to add another layer of namespacing to map names -- two letter country codes. Here's a quick list of stuff I need to account for...

the updater and DataPacks refer to cities as a string. should we keep a string and make it us/seattle, or make a new CityName struct to be more explicit?
the importer takes a --city flag -- this'll become the ca/montreal style
lots of docs will need updating
don't forget to rgrep for anything using paths directly. shouldn't be much outside of shell scripts / python API examples
is this a MapEdits change that we have to automatically fixup?
the map picker UI needs another overhaul; the city list on the left now scrolls way too much. Expose the hierarchy there. Can we pull in SVG country flags?

dabreegster commented 3 years ago

Alright, the grand renaming is done. I'll make the city picker nicer tomorrowish.

dabreegster commented 3 years ago

Some recent hardware failures have spurred me to think about moving the map importing process into the cloud again. data/regen.sh on my now dead machine took at least an hour, but the process could at least be parallelized by city. What would the development workflow look like?

1) Locally, work on new changes to the map importer. Run the importer and test locally, as usual. 2) When it's time to regenerate the world, package up the importer binary with the local changes somehow -- maybe in a git branch, or a temporary docker image that directly copies in the local Linux binary. 3) Tell some cloud service to go run one job per configured city. 4) The input for that per-city job is at least the clipped .osm file, so probably it needs to run the updater first for just the city it's working on. 5) The output should go in S3, in some temporary named version that can later be renamed to the dev version. 6) That cloud service has some kind of web or CLI UI to track job progress and view STDOUT/STDERR for jobs. 7) After the jobs are all done, the developer runs another script to pull down only some of the changed files -- Seattle and the few other cities that have screenshot testing or prebaked results. Manually run those other tools, producing a bit more data. 8) Once everything's confirmed, need to merge all of the S3 directories into one nice dev version. Also need to produce a merged data/MANIFEST.txt file and commit it somehow. 9) Push the git commit. Done!

As a sort of interim solution for step 8, I can download all the changed files and produce the manifest locally; my downstream is fast, but upstream is still bad. A better solution long-term is probably to split the manifest file into per-city and have a better abstraction in the code for reading/merging all of them.

From a cursory glance, AWS batch looks like a reasonable fit for the cloud service, since it can run Docker images, does some output redirection and logging by default, has a web UI, and at least has configuration for balancing speed/cost.

dabreegster commented 2 years ago

Most recent mass reimport was painful operationally. I have a small improvement I want to try:

1) Use the existing per-city jobs in regenerate_everything, but physically run a separate process per city 2) Use https://github.com/Nukesor/pueue to run cities in parallel and get nice split logs and overall tracking. In that way, it dodges some of the issues of #262

dabreegster commented 2 years ago

pueue works well enough, but I was hoping for a nicer summary counting jobs by status. Either way, few minutes work means now I can melt my laptop super fast. In 8 minutes of parallelization, I can fully regenerate 70 of 88 cities. A considerable win for my workflow. There are also per-city logs, and one city failing won't break everything down.

dabreegster commented 2 years ago

Now there are two long tails! Screenshot from 2022-02-10 16-13-34 Parallelizing Seattle is hard because of weird dependencies between huge_seattle and the rest, and keeping that map in memory. But possible next step, could parallelize London by each borough map.

a-b-street / abstreet

Support O(100) cities #326

Downloading optional content

Storing the maps

Map config

Maintaining the maps