Different output between two machines

vvendi commented 11 months ago

Hello,

I'm using the Eqasim pipeline to generate a population for the city of Calais, France, and it works like a charm. However, I have a small problem which I will do my best to describe here, hoping for a solution : When using the pipeline on a single computer, I always get the same output files for a given set of input files, which is perfect. The issue is that if I use the same set of input files on another computer and run the pipeline, I no longer get exactly the same output as on my first computer. To take it a step further, I created two virtual machines, installed the same version of each software, and ran the pipeline using the exact same set of input files, and the outputs were slightly different. If I take one of these virtual machines and simply create a copy of it, I will get the same output running the pipeline as in the original machine. Therefore, I wonder : does the pipeline use some kind of identifier of the computer as some sort of random seed for the pipeline ? And is there a way to unify the pipeline output on any computer ?

Thank you in advance for your time

sebhoerl commented 11 months ago

Hi, thanks for sharing this observation. A priori, the problem sounds improbable to me, because even in the continuous integration tests on Github we run the pipeline tests once on a Linux instance and a Windows instance to check if we get the same results. And normally, everything is dependent on a single random seed that can be defined via configuration.

I no longer get exactly the same output as on my first computer

If you say that, what exactly do you compare? Maybe this is actually the mismatch here, because not all outputs are deterministic. Those of the population pipeline (CSV, GPKG) should be deterministic. Those of MATSim usually not, because MATSim uses some Java structures that are order-independent (like Set) so it can happen that the <person> in the population are saved in different orders. And I can well imagine that this is rather dependent on machine configuration rather than changes from run to run. Could it be that?

sebhoerl commented 11 months ago

See especially REFERENCE_CSV_HASHES, REFERENCE_GPKG_HASHES, and REFERENCE_HASHES in tests/test_determinism.py to see which files are actually tested to be deterministic.

vvendi commented 11 months ago

Hi, Thank you for your answer.

The first thing I compare when I run the full pipeline + simulation process is the interest variable for my work : average travel time of people using bus. On a single machine, this value is near constant, with only a slight noise that doesn't change the value by more than 0.5s average travel time. By running on another machine however, the value is very different by multiple seconds. Even If I go back one step, without the simulation as you said MATSim should not be deterministic, I have differencies. By simply running the pipeline I have differencies between the CSV, GPKG and XML.GZ output files. Have a look at this very simplified way to look at it, I simply ran a "ls -l" command in the output folders of the pipeline and compared the size of the output files :

Machine 1 & it's clone gave me : -rw-rw-r-- 1 user user 91490994 oct. 12 14:17 npdc_activities.csv -rw-r--r-- 1 user user 208384000 oct. 12 14:21 npdc_activities.gpkg -rw-r--r-- 1 user user 18817024 oct. 12 14:22 npdc_commutes.gpkg -rw-rw-r-- 1 user user 56493 oct. 12 14:41 npdc_config.xml -rw-rw-r-- 1 user user 5367106 oct. 12 14:41 npdc_facilities.xml.gz -rw-r--r-- 1 user user 16822272 oct. 12 14:21 npdc_homes.gpkg -rw-rw-r-- 1 user user 8125136 oct. 12 14:17 npdc_households.csv -rw-rw-r-- 1 user user 4921067 oct. 12 14:41 npdc_households.xml.gz -rw-rw-r-- 1 user user 206 oct. 12 11:57 npdc_meta.json -rw-rw-r-- 1 user user 1557165 oct. 12 14:41 npdc_network.xml.gz -rw-rw-r-- 1 user user 22160584 oct. 12 14:17 npdc_persons.csv -rw-rw-r-- 1 user user 81911299 oct. 12 14:41 npdc_population.xml.gz -rw-rw-r-- 1 user user 50588851 oct. 12 14:41 npdc_run.jar -rw-rw-r-- 1 user user 110137 oct. 12 14:41 npdc_transit_schedule.xml.gz -rw-rw-r-- 1 user user 2456 oct. 12 14:41 npdc_transit_vehicles.xml.gz -rw-rw-r-- 1 user user 80333872 oct. 12 14:17 npdc_trips.csv -rw-r--r-- 1 user user 248049664 oct. 12 14:26 npdc_trips.gpkg (Only the time of the file creation changed between the original & the clone, obviously)

Machine 2 gave me : -rw-rw-r-- 1 user user 91490994 oct. 12 15:56 npdc_activities.csv -rw-r--r-- 1 user user 208453632 oct. 12 16:00 npdc_activities.gpkg -rw-r--r-- 1 user user 18808832 oct. 12 16:01 npdc_commutes.gpkg -rw-rw-r-- 1 user user 56493 oct. 12 16:05 npdc_config.xml -rw-rw-r-- 1 user user 5367568 oct. 12 16:05 npdc_facilities.xml.gz -rw-r--r-- 1 user user 16830464 oct. 12 16:00 npdc_homes.gpkg -rw-rw-r-- 1 user user 8125136 oct. 12 15:56 npdc_households.csv -rw-rw-r-- 1 user user 4921067 oct. 12 16:05 npdc_households.xml.gz -rw-rw-r-- 1 user user 206 oct. 6 17:29 npdc_meta.json -rw-rw-r-- 1 user user 1557360 oct. 12 16:05 npdc_network.xml.gz -rw-rw-r-- 1 user user 22160584 oct. 12 15:56 npdc_persons.csv -rw-rw-r-- 1 user user 81842105 oct. 12 16:05 npdc_population.xml.gz -rw-rw-r-- 1 user user 50588853 oct. 12 16:05 npdc_run.jar -rw-rw-r-- 1 user user 105827 oct. 12 16:05 npdc_transit_schedule.xml.gz -rw-rw-r-- 1 user user 2456 oct. 12 16:05 npdc_transit_vehicles.xml.gz -rw-rw-r-- 1 user user 80333872 oct. 12 15:56 npdc_trips.csv -rw-r--r-- 1 user user 247967744 oct. 12 16:05 npdc_trips.gpkg

We can observe differencies of size (and therefore in content, as it should be deterministic) in files :

npdc_activities.gpkg
npdc_commutes.gpkg
npdc_facilities.xml.gz
npdc_homes.gpkg
npdc_network.xml.gz
npdc_population.xml.gz
npdc_run.jar
npdc_trips.gpkg

Let me know if it makes sense to you. Maybe is there a test I could run to verify the deterministic behaviour on my setup. Or maybe I made a mistake in the configuration of the pipeline.

sebhoerl commented 11 months ago

Yes, that makes sense. Both GPKG and GZ add timestamps to the packaged content, so every time you create the file the size will be slightly different. In the unit tests, we fix this by temporarily resetting this information. The important thing is that the CSV files are deterministic.

For the MATSim inputs we know that there may be problems with different orderings of the persons, for instance. But we have never checked determinism of the MATSim simulations themselves. Technically speaking, this is a MATSim problem, not a pipeline demand generation problem :)

Would be nice to know if you see differences when you run (only the) MATSim simulations with the same input data on different machines. If the outputs are different, this probably has to do something with random number generation (or also when reading that the order of input information is not preserved).

vvendi commented 11 months ago

Thank you very much for this answer, I'll try what you just suggested by running the MATSim simulation on two machines with the same piepeline generated input. I'll let you know the result ASAP.

vvendi commented 11 months ago

Hello again, new update from the previous experiment. I took both outputs of the pipeline from my two different machines (Later I will refer to the first machine as Machine 1, and the second machine and its clone as Machine 2 & 3) and use them as inputs for a MATSim simulation on every machine that I have. I took again the metric that is important for me, average travel time of people using bus, as it's a great way to check quickly if the simulation produced different stats.

Simulation on Machine 1 - Input generated from Machine 1 - average pt travel time : 2755.18 Simulation on Machine 2 - Input generated from Machine 1 - average pt travel time : 2755.18 Simulation on Machine 3 - Input generated from Machine 1 - average pt travel time : 2755.18

Simulation on Machine 1 - Input generated from Machine 2 - average pt travel time : 2744.21 Simulation on Machine 2 - Input generated from Machine 2 - average pt travel time : 2744.21 Simulation on Machine 3 - Input generated from Machine 2 - average pt travel time : 2744.21

As we can see in the previous log, a single pipeline output produces the same stats when used as an input for a MATSim simulation regardless of the machine the simulation runs on. The difference seems to come from the files produced by the pipeline then. I don't really know what to think anymore to be honest 😕

sebhoerl commented 11 months ago

Ok, so this means that up to generating the CSV everything is deterministic (like we cover in the unit tests), but then generating the MATSim files seems to introduce changes. This could be for instance here where the population / households etc. are created: https://github.com/eqasim-org/ile-de-france/tree/develop/matsim/scenario

Or then here were everything is routed (prepare): https://github.com/eqasim-org/ile-de-france/tree/develop/matsim/simulation

The pipeline is generating temporary files in whatever is configured as the working_directry. There you will find the preliminary files generated in each stage (for instance, synthesis.population.matched__5f1668f4aba0122d7b75a5d01b3d2952.cache). You could try to check these files to maybe see where the outputs start diverging (but keep in mind that the *gz files have the timestamp, so it would be better to first unpack them using gunzip and then compare, ideally also rather using md5sum instead of just the file size).

Thanks for doing these tests, this is quite valuable! May I ask what institution you are working in? You can also contact me via email (sebastian [dot] horl [at] irt-systemx.fr)

vvendi commented 11 months ago

I compared the md5sum of each temporary file in the "working_directory". The following files appear to be different on my two machines :

data.ban.rawe6a0a181b0beb2bae5318d14849a3447.p data.bdtopo.raw66a98cf0847208fc6c91e8f2c0065c3e.p pipeline.json pipeline.json.bk synthesis.locations.home.addressesd8eb9e7dddde45017e98d01085964f16.p synthesis.locations.home.locationsd8eb9e7dddde45017e98d01085964f16.p synthesis.population.spatial.home.locations27e2ede7a8e8b98318d4189329436654.p synthesis.population.spatial.locations4038ac4db7f9d81a45bec998f913dfc8.p synthesis.population.spatial.primary.locationsf1d279e3ec4815d244ad32e9c52bd4fa.p synthesis.population.spatial.secondary.locations4038ac4db7f9d81a45bec998f913dfc8.p

There's also a bunch of files in the matsim.runtime.xxxxx.cache ; matsim.scenario.xxxxx.cache ; matsim.simulation.xxxxx.cache folders that are different, but I assume that they are created after the above files and not really worth mentionning.

Could the differencies come from the parsing of the two files "ban" & "bdtopo" then ?

sebhoerl commented 11 months ago

Thanks a lot for this analysis! So basically it looks like loading the BDTOPO and BAN data is not determinstic, we will look into that, except you want to have a look on your own :)

These are two data sets that only recently have been integrated, so this look very coherent to me (and it explains why we don't see it in the unit tests -> there we always have a constant input).

sebhoerl commented 11 months ago

I found a potential fix with #203. Could you try again with this PR? Or with develop once it is merged (should happen automatically in a couple of minutes).

vvendi commented 11 months ago

It seems to work with this last update ! I tried on 3 machines and all of them gave me the same metrics. Thank you very much ! I'm closing the issue 😃

eqasim-org / ile-de-france

Different output between two machines #202