arup-group / genet

Manipulate MATSim networks via a Python API.
MIT License
44 stars 9 forks source link

Volume mounting behaviour has changed in the latest version of the Docker image #222

Closed mfitz closed 10 months ago

mfitz commented 10 months ago

The Problem

Our daily Matesto CI pipeline has been failing for the last three days. The failed step is always GeNet network simplification, which fails consistently in the same way since the most recent PR - which included some changes to Dockerfile - was merged.

The error from the Popper-managed pipeline looks like this:

DEBUG: Container args: {'image': '************.dkr.ecr.eu-west-1.amazonaws.com/genet:latest', 'command': ['genet', 'simplify-network', '-n', 'matsim12-test-town/network.xml', '-s', 'matsim12-test-town/transitschedule.xml', '-p', 'epsg:27700', '-od', 'working-dir/genet'], 'name': 'popper_genet-testtown12-network-simplification_0b0c3d7f', 'volumes': ['/home/arup/matesto:/workspace:Z', '/home/arup/matesto/data/matsim11-test-town:/matsim11-test-town', '/home/arup/matesto/data/matsim12-test-town:/matsim12-test-town', '/home/arup/matesto/data/events:/events', '/home/arup/matesto/data/londinium:/londinium', '/home/arup/matesto/working-dir:/working-dir', '/var/run/docker.sock:/var/run/docker.sock'], 'working_dir': '/', 'environment': {'GIT_COMMIT': 'c712c8df6bbc0d73c230883c6a77924a016e115b', 'GIT_BRANCH': 'add_script_to_run_all_pipelines', 'GIT_SHA_SHORT': 'c712c8d', 'GIT_REMOTE_ORIGIN_URL': 'https://github.com/arup-group/matesto', 'GIT_TAG': ''}, 'entrypoint': None, 'detach': True, 'tty': False, 'stdin_open': False, 'privileged': True, 'mem_limit': '10g'}

[genet-testtown12-network-simplification] docker create name=popper_genet-testtown12-network-simplification_0b0c3d7f image=************.dkr.ecr.eu-west-1.amazonaws.com/genet:latest command=['genet', 'simplify-network', '-n', 'matsim12-test-town/network.xml', '-s', 'matsim12-test-town/transitschedule.xml', '-p', 'epsg:27700', '-od', 'working-dir/genet']
[genet-testtown12-network-simplification] docker start
Usage: genet simplify-network [OPTIONS]
Try 'genet simplify-network --help' for help.

Error: Invalid value for '-od' / '--output_dir': Directory 'working-dir/genet' is not writable.

Some Investigation

Manually running network simplification via the Docker CLI

Using the previous version of GeNet's Docker image

All good.

 docker run -v /home/arup/matesto/data/matsim12-test-town:/matsim12-test-town -v /home/arup/matesto/working-dir:/working-dir ************.dkr.ecr.eu-west-1.amazonaws.com/genet:29993df genet simplify-network -n matsim12-test-town/network.xml -s matsim12-test-town/transitschedule.xml -p epsg:27700 -od working-dir/genet

2024-01-11 22:16:06,800 - Reading in network at matsim12-test-town/network.xml
2024-01-11 22:16:06,802 - Reading in schedule at matsim12-test-town/transitschedule.xml
2024-01-11 22:16:06,802 - No vehicles file given with the Schedule, vehicle types will be based on the default.
2024-01-11 22:16:07,452 - Simplifying the Network.
2024-01-11 22:16:07,453 - Begin simplifying the graph
2024-01-11 22:16:07,453 - Generating paths to be simplified
2024-01-11 22:16:07,453 - Identified 8 edge endpoints
2024-01-11 22:16:07,453 - Identified 14 possible paths
2024-01-11 22:16:07,453 - Processing 14 paths
2024-01-11 22:16:07,454 - Found 6 paths to simplify.
2024-01-11 22:16:07,454 - Generated 6 link ids.
2024-01-11 22:16:07,454 - Processing links for all paths to be simplified
2024-01-11 22:16:07,459 - Adding new simplified links
2024-01-11 22:16:07,481 - Generated 0 link ids.
2024-01-11 22:16:07,487 - Added 6 links
2024-01-11 22:16:07,491 - Simplified graph: 14 to 8 nodes, 26 to 14 edges
2024-01-11 22:16:07,492 - Updating the Schedule
2024-01-11 22:16:07,492 - Mapping attributes resulted in 0 changes. Ensure your location variable: linkRefId exists as keys in the input dictionaries. Only dictionaries with location=linkRefId keys will be mapped.
2024-01-11 22:16:07,493 - Changed Stop attributes for 0 stops
2024-01-11 22:16:07,493 - Updated Stop Link Reference Ids
2024-01-11 22:16:07,497 - Changed Route attributes for 2 routes
2024-01-11 22:16:07,497 - Updated Network Routes
2024-01-11 22:16:07,497 - Finished simplifying network
2024-01-11 22:16:07,500 - This took 0.001 min.
2024-01-11 22:16:07,500 - Simplification resulted in 18 links being simplified.
2024-01-11 22:16:07,503 - Checking for disconnected subgraphs
2024-01-11 22:16:07,505 - This took 0.0 min.
2024-01-11 22:16:07,505 - Writing working-dir/genet/network.xml
2024-01-11 22:16:07,515 - Writing working-dir/genet/schedule.xml
2024-01-11 22:16:07,517 - Writing working-dir/genet/vehicles.xml
2024-01-11 22:16:07,526 - Generating scaled vehicles xml.
2024-01-11 22:16:07,527 - Writing working-dir/genet/1_perc_vehicles.xml
2024-01-11 22:16:07,529 - Created scaled vehicle file for 1% capacity & 1% pce.
2024-01-11 22:16:07,530 - Writing working-dir/genet/10_perc_vehicles.xml
2024-01-11 22:16:07,531 - Created scaled vehicle file for 10% capacity & 10% pce.
2024-01-11 22:16:07,531 - Generating validation report
2024-01-11 22:16:07,532 - Checking validity of the Network
2024-01-11 22:16:07,532 - Checking validity of the Network graph
2024-01-11 22:16:07,532 - Defaulting to checking graph connectivity for modes: ['car', 'walk', 'bike']. You can change this by passing a `modes_for_strong_connectivity` param
2024-01-11 22:16:07,532 - Checking network connectivity for mode: car
2024-01-11 22:16:07,533 - The graph for mode: car has: 1 connected components, 0 sinks/dead_ends and 0 sources/unreachable nodes.
2024-01-11 22:16:07,533 - Checking network connectivity for mode: walk
2024-01-11 22:16:07,534 - The graph for mode: walk has: 1 connected components, 0 sinks/dead_ends and 0 sources/unreachable nodes.
2024-01-11 22:16:07,534 - Checking network connectivity for mode: bike
2024-01-11 22:16:07,535 - The graph for mode: bike has: 1 connected components, 0 sinks/dead_ends and 0 sources/unreachable nodes.
2024-01-11 22:16:07,536 - Checking link values for `modes`
2024-01-11 22:16:07,538 - Checking link values for `permlanes`
2024-01-11 22:16:07,539 - Checking link values for `capacity`
2024-01-11 22:16:07,540 - Checking link values for `freespeed`
2024-01-11 22:16:07,541 - Checking link values for `length`
2024-01-11 22:16:07,542 - Checking link values for `ids`
2024-01-11 22:16:07,542 - Checking validity of the Schedule
2024-01-11 22:16:07,542 - Computing headway stats
2024-01-11 22:16:07,609 - Checking validity of PT vehicles
2024-01-11 22:16:07,620 - All vehicles are being used.
2024-01-11 22:16:07,629 - No vehicles being used for multiple trips have been found.
2024-01-11 22:16:07,630 - Computing speeds
2024-01-11 22:16:07,690 - Checking speeds for prohibitive values 0 and infinity. You should verify speed values separately
2024-01-11 22:16:07,698 - Network does not have intermodal access/egress connections
2024-01-11 22:16:07,699 - Graph validation: {'car': {'problem_nodes': {'dead_ends': [], 'unreachable_node': []}, 'number_of_connected_subgraphs': 1}, 'walk': {'problem_nodes': {'dead_ends': [], 'unreachable_node': []}, 'number_of_connected_subgraphs': 1}, 'bike': {'problem_nodes': {'dead_ends': [], 'unreachable_node': []}, 'number_of_connected_subgraphs': 1}}
2024-01-11 22:16:07,699 - Schedule level validation: True
2024-01-11 22:16:07,699 - Routing validation: True
2024-01-11 22:16:07,701 - Generating summary report
2024-01-11 22:16:07,701 - Creating a summary report
2024-01-11 22:16:07,705 - Generating geojson outputs for the entire network in working-dir/genet/standard_outputs
2024-01-11 22:16:07,808 - Saving Network to GeoJSON in working-dir/genet/standard_outputs
2024-01-11 22:16:08,323 - Saving Schedule to GeoJSON in working-dir/genet/standard_outputs
2024-01-11 22:16:08,415 - Generating geojson outputs for car/driving modal subgraph
2024-01-11 22:16:08,476 - Generating geojson outputs for different highway tags in car modal subgraph
2024-01-11 22:16:08,478 - Generating geometry-only geojson outputs for cycle modal subgraph
2024-01-11 22:16:08,500 - Generating geometry-only geojson outputs for walk modal subgraph
2024-01-11 22:16:08,521 - Generating geometry-only geojson outputs for car modal subgraph
2024-01-11 22:16:08,542 - Generating geometry-only geojson outputs for bus modal subgraph
2024-01-11 22:16:08,564 - Generating geometry-only geojson outputs for bike modal subgraph
2024-01-11 22:16:08,586 - Generating geojson standard outputs for schedule
2024-01-11 22:16:08,633 - Generating vehicles per hour for bus
2024-01-11 22:16:08,685 - Generating schedule graph for bus
2024-01-11 22:16:08,749 - Saving vehicles per hour for all PT modes
2024-01-11 22:16:08,771 - Saving vehicles per hour for all PT modes for selected hour slices
2024-01-11 22:16:08,843 - Generating stop-to-stop speed outputs with network_factor=1.3
2024-01-11 22:16:08,875 - Right now routed speeds do not account for services snapping to long network links. Be sure to account for that in your investigations and check the non-routed `pt_speeds`output as well.
2024-01-11 22:16:08,980 - Generating csv for vehicles per hour for each service
2024-01-11 22:16:09,000 - Generating csv for vehicles per hour per stop
2024-01-11 22:16:09,032 - Generating csvs for trips per day
2024-01-11 22:16:09,064 - Generating PT network routes
2024-01-11 22:16:09,126 - Creating a summary report
2024-01-11 22:16:09,129 - Finished generating standard outputs. Zipping folder.

Using the current version of GeNet's docker image

The exact same command fails, apparently because GeNet cannot read the volume mounted into the container at matsim12-test-town.

docker run -v /home/arup/matesto/data/matsim12-test-town:/matsim12-test-town -v /home/arup/matesto/working-dir:/working-dir ************.dkr.ecr.eu-west-1.amazonaws.com/genet:latest genet simplify-network -n matsim12-test-town/network.xml -s matsim12-test-town/transitschedule.xml -p epsg:27700 -od working-dir/genet

Usage: genet simplify-network [OPTIONS]
Try 'genet simplify-network --help' for help.

Error: Invalid value for '-n' / '--network': Path 'matsim12-test-town/network.xml' does not exist.

Checking the default user inside the Docker container

docker run ************.dkr.ecr.eu-west-1.amazonaws.com/genet:latest whoami
mambauser

docker run ************.dkr.ecr.eu-west-1.amazonaws.com/genet:29993df whoami
root

It looks like changing the base image from python:3.11.4-bullseye to mambaorg/micromamba:1.5.3-bullseye-slim in this PR has changed the default user inside the container.

It seems likely that the problem is the failure of non-root users inside the container to successfully mount volumes on the host machine. Modifying the permissions on the directories on the host machine may be a viable workaround. It is also possible to override the default container user via the --user parameter to the Docker run command, but the Popper library we are using in Matesto does not provide a way to do the same thing programmatically.

A quick fix would probably be to change the user in the Dockerfile, but we will need to have a conversation about that if it was a deliberate decision to move away from the root user in the container.

mfitz commented 10 months ago

After further investigation, I think this is a difference in the behaviour around mounted volumes between the previous and current base Docker images, but it isn't a failure to mount the volumes as I previously suspected. Instead it looks like a difference in how each image allows you to specify paths under the volume mount point.

Using the old image, I can mount the volume at /matsim12-test-town, but then use matsim12-test-town with no slash when specifying paths under the mount point:

$ docker run -v /home/arup/matesto/data/matsim12-test-town:/matsim12-test-town ************.dkr.ecr.eu-west-1.amazonaws.com/genet:29993df ls -talh matsim12-test-town
total 272K
drwxr-xr-x 1 root root 4.0K Jan 11 23:54 ..
drwxrwxrwx 4 1000 1000 4.0K Nov  7 20:20 .
-rwxrwxrwx 1 1000 1000  27K Nov  7 20:20 qsim_matsim_config_test_town_12_multimodal_config.xml
-rwxrwxrwx 1 1000 1000  27K Nov  7 20:20 hermes_matsim_config_test_town_12.xml
-rwxrwxrwx 1 1000 1000  37K Nov  7 20:19 qsim_matsim_config_test_town_12_runmatsim_config.xml
drwxrwxrwx 2 1000 1000 4.0K Oct 27 10:11 elara-config
drwxrwxrwx 2 1000 1000 4.0K Dec 19  2022 bitsim
-rwxrwxrwx 1 1000 1000 4.0K Dec 19  2022 network.xml
-rwxrwxrwx 1 1000 1000  27K Dec 19  2022 test_multimodal_config.xml
-rwxrwxrwx 1 1000 1000 2.1K Dec 19  2022 all_vehicles.xml
-rwxrwxrwx 1 1000 1000 2.5K Oct 11  2021 DATASET.md
-rwxrwxrwx 1 1000 1000 2.9K Aug 25  2021 population_v12.xml
-rwxrwxrwx 1 1000 1000 1.3K Apr 30  2021 attributes.xml
-rwxrwxrwx 1 1000 1000 2.3K Apr 30  2021 facilities.xml
-rwxrwxrwx 1 1000 1000 3.9K Apr 30  2021 population.xml
-rwxrwxrwx 1 1000 1000 3.8K Apr 30  2021 population_multimodal.xml
-rwxrwxrwx 1 1000 1000 3.4K Apr 30  2021 population_multimodal_no_network_links.xml
-rwxrwxrwx 1 1000 1000  359 Apr 30  2021 road-pricing.xml
-rwxrwxrwx 1 1000 1000  26K Apr 30  2021 test_config.xml
-rwxrwxrwx 1 1000 1000  26K Apr 30  2021 test_facilities_config.xml
-rwxrwxrwx 1 1000 1000  27K Apr 30  2021 test_multimodal_config_simplified_network.xml
-rwxrwxrwx 1 1000 1000  802 Apr 30  2021 transitVehicles.xml
-rwxrwxrwx 1 1000 1000 2.3K Apr 30  2021 transitschedule.xml

If I do the same thing with the new image, I cannot see the directory:

$ docker run -v /home/arup/matesto/data/matsim12-test-town:/matsim12-test-town ************.dkr.ecr.eu-west-1.amazonaws.com/genet:latest ls -talh matsim12-test-town
ls: cannot access 'matsim12-test-town': No such file or directory

However, this is not because the volume has failed to mount. We can see it as a file system inside the container:

$ docker run -v /home/arup/matesto/data/matsim12-test-town:/matsim12-test-town ************.dkr.ecr.eu-west-1.amazonaws.com/genet:latest df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          62G   32G   31G  51% /
tmpfs            64M     0   64M   0% /dev
tmpfs           479M     0  479M   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
/dev/root        62G   32G   31G  51% /matsim12-test-town
tmpfs           479M     0  479M   0% /proc/acpi
tmpfs           479M     0  479M   0% /proc/scsi
tmpfs           479M     0  479M   0% /sys/firmware

Rather, we cannot see the directory because we can no longer get away with omitting the leading slash from the path. If we add the slash to the path, we're in business:

$ docker run -v /home/arup/matesto/data/matsim12-test-town:/matsim12-test-town ************.dkr.ecr.eu-west-1.amazonaws.com/genet:latest ls -talh /matsim12-test-town
total 272K
drwxr-xr-x 1 root root 4.0K Jan 11 23:54 ..
drwxrwxrwx 4 1000 1000 4.0K Nov  7 20:20 .
-rwxrwxrwx 1 1000 1000  27K Nov  7 20:20 qsim_matsim_config_test_town_12_multimodal_config.xml
-rwxrwxrwx 1 1000 1000  27K Nov  7 20:20 hermes_matsim_config_test_town_12.xml
-rwxrwxrwx 1 1000 1000  37K Nov  7 20:19 qsim_matsim_config_test_town_12_runmatsim_config.xml
drwxrwxrwx 2 1000 1000 4.0K Oct 27 10:11 elara-config
drwxrwxrwx 2 1000 1000 4.0K Dec 19  2022 bitsim
-rwxrwxrwx 1 1000 1000 4.0K Dec 19  2022 network.xml
-rwxrwxrwx 1 1000 1000  27K Dec 19  2022 test_multimodal_config.xml
-rwxrwxrwx 1 1000 1000 2.1K Dec 19  2022 all_vehicles.xml
-rwxrwxrwx 1 1000 1000 2.5K Oct 11  2021 DATASET.md
-rwxrwxrwx 1 1000 1000 2.9K Aug 25  2021 population_v12.xml
-rwxrwxrwx 1 1000 1000 1.3K Apr 30  2021 attributes.xml
-rwxrwxrwx 1 1000 1000 2.3K Apr 30  2021 facilities.xml
-rwxrwxrwx 1 1000 1000 3.9K Apr 30  2021 population.xml
-rwxrwxrwx 1 1000 1000 3.8K Apr 30  2021 population_multimodal.xml
-rwxrwxrwx 1 1000 1000 3.4K Apr 30  2021 population_multimodal_no_network_links.xml
-rwxrwxrwx 1 1000 1000  359 Apr 30  2021 road-pricing.xml
-rwxrwxrwx 1 1000 1000  26K Apr 30  2021 test_config.xml
-rwxrwxrwx 1 1000 1000  26K Apr 30  2021 test_facilities_config.xml
-rwxrwxrwx 1 1000 1000  27K Apr 30  2021 test_multimodal_config_simplified_network.xml
-rwxrwxrwx 1 1000 1000  802 Apr 30  2021 transitVehicles.xml
-rwxrwxrwx 1 1000 1000 2.3K Apr 30  2021 transitschedule.xml

I'm closing this issue because no action needs to be taken in GeNet (but I will be fixing up some pipelines in Matesto...).