EC2 instance not using available memory?

shriv commented 6 months ago

Hi guys, I'm running gelato over a baseline MATSim simulation with Docker on an EC2 instance (r5a.4xlarge). It's been running for several hours but CPU usage is only 7% and memory usage is only around 13GB (far less than the 128GB available). I used the following Docker command and possibly gone awry somewhere! 😓

docker run \
-v ${PWD}/monty-data/:/my-model-outputs \
-v ${PWD}/monty-data/:/gelato-out \
-e JVM_OPTS="-Xmx100G" \
-e JVM_OPTS="-XX:MaxRAMPercentage=80.0" \
gelato \
-mc /my-model-outputs/output_config.xml \
-mo /my-model-outputs \
-o /gelato-out

mfitz commented 6 months ago

Hi @shriv!

Long time, no speak 😄

Thanks for this. I don't think there's anything wrong with your Docker command. The two JVM params are fairly much either/or (an absolute value for max heap size or a percentage of the memory available to the container), but I doubt that is causing a problem.

I think it's much more likely that Gelato is just struggling with your particular model for some reason.

A few questions:

Which version of Gelato are you using? Presumably, you cloned the repo and built the Docker image locally? What does docker run gelato --version give you?
What does the log output look like? Is it progressing at all, or does it seem to hang? Could you share the log?
What does your model look like?
- How many events are in the events file?
- Does your model include road pricing? Are there person money events in the events file?
- How many agents in the sim?
- How many nodes and links are in the network file?
- Which version of MATSim did you use?
- Could you share the MATSim output files so I can run it myself?

mfitz commented 6 months ago

One minor thing:

-e JVM_OPTS="-Xmx100G" \ -e JVM_OPTS="-XX:MaxRAMPercentage=80.0" \

The JVM_OPTS env var should contain all of your JVM params in a single place. If you need to set more than one, use a space-separated list inside the same string, as in this example:

docker run \
-v /path/to/my-model/outputs/:/my-model-outputs \
-v /path/to/gelato-outputs/my-model:/gelato-out \
-e JVM_OPTS="-Xmx100G -XX:+PrintFlagsFinal" \
gelato \
-mc /my-model-outputs/output_config.xml \
-mo /my-model-outputs \
-o /gelato-out

I am not sure what happens when you set two different values for the same env var, which is what I think you're doing here. Maybe the second value overrides the first. In any case, even if you successfully set both of those JVM params, you would potentially have set two different values for max heap size (unless 80% of available RAM works out to 100GB in your environment, in which case you would be setting the same max heap size twice).

Again, though, I don't think this is likely to be causing a problem.

To be fair, the docs don't make this clear, so I created #86 to fix that.

shriv commented 6 months ago

Hi @mfitz Thanks heaps for your suggestions :-) Some notes below on my previous and current runs. I've now changed the Docker run command to just have one JVM opts and running on a different machine (r5a.2xlarge) to see if it makes a difference.

Previous run on 128GB RAM

Gelato version is 0.0.4-alpha
I was running it on a 1.6GB events file on instance type: r5a.4xlarge # 128 gb (running ubuntu 22.04)
The following files were not in the sim outputs
- output_links.csv.gz
- output_allVehicles.xml.gz
- output_activities.csv.gz

I see tolls pricing in the config

<module name="montyPricing" >
<param name="tollCap" value="-50.0" />
<param name="tollLinksFile" value="/mot-non-production-simulations/applications/pt_variations_2018/inputs/tolls.xml.gz" />
</module>

Current run on 64GB RAM

In progress

docker run \
-v ${PWD}/monty-data:/my-model-outputs \
-v ${PWD}/gelato-output:/gelato-out \
-e JVM_OPTS="-Xmx60G -XX:+PrintFlagsFinal" \
gelato \
-mc /my-model-outputs/output_config.xml \
-mo /my-model-outputs \
-o /gelato-out

shriv commented 6 months ago

Had another go (as per command in previous comment) and got RAM to use up to 20G of the 64GB available and ~15% CPU but no outputs even after running for 8hours. I had the following output during the processing re. your question on number of events being processed. Let me know if I just need to kick it off and run it overnight? If I do that, I assume I can just run Docker as a detached process with docker -d ... ?

2024-05-12T21:24:41,594  INFO MatsimKpiGenerator:119 Recorded 104,249,952 relevant MATSim events from /my-model-outputs/output_events.xml.gz
2024-05-12T21:24:41,637  INFO MatsimKpiGenerator:121 {
  "vehicle enters traffic" : 1139673,
  "PersonLeavesVehicle" : 1229877,
  "entered link" : 49755426,
  "PersonEntersVehicle" : 1229894,
  "vehicle leaves traffic" : 1139656,
  "left link" : 49755426
}
2024-05-12T21:24:41,638  INFO MatsimKpiGenerator:119 Recorded 598,352 relevant MATSim events from /my-model-outputs/output_events.xml.gz
2024-05-12T21:24:41,646  INFO MatsimKpiGenerator:121 {
  "personMoney" : 598352
}

mfitz commented 6 months ago

Let me know if I just need to kick it off and run it overnight? If I do that, I assume I can just run Docker as a detached process with docker -d ... ?

By all means, leave it running - it will be interesting to see what happens.

What was the last thing logged to the console? That will give us a big clue about what it's doing and where it is spending the most time. We've successfully run Gelato on similar-sized models before now, but that was before we introduced the support for person money events, IIRC.

If you do leave it running, could you capture the log output in a file so we can analyse it later? A low-tech way to do that is to pipe the output of the Docker command to a file (> /path/to/file/gelato.log)

I have what I assume is a subset of a NZ model (not sure which) that came to me via @gac55. I think it was originally from Patrick. Gelato handles that comfortably in around 3 minutes, but the events file is a lot smaller than yours:

2024-05-10T14:52:25,773  INFO MatsimKpiGenerator:119 Recorded 10,740,360 relevant MATSim events from /nz-model-outputs/output_events.xml.gz
2024-05-10T14:52:25,800  INFO MatsimKpiGenerator:121 {
  "vehicle enters traffic" : 53733,
  "PersonLeavesVehicle" : 54626,
  "entered link" : 5261821,
  "PersonEntersVehicle" : 54642,
  "vehicle leaves traffic" : 53717,
  "left link" : 5261821
}
2024-05-10T14:52:25,801  INFO MatsimKpiGenerator:119 Recorded 6,192 relevant MATSim events from /nz-model-outputs/output_events.xml.gz
2024-05-10T14:52:25,804  INFO MatsimKpiGenerator:121 {
  "personMoney" : 6192
}

Are you able to share all of your MATSim output files? If we can get our hands on those, we can run Gelato against them and try and debug it locally.

Also, if you're able to run Gelato in the JVM directly, rather than via Docker, I'd be interested to know if the behaviour is the same. That might be worth a try, just to give us another data point.

shriv commented 6 months ago

YAY! The gelato run finished last night around 2am. I've saved the logs and outputs to S3 for analysis. I'll be taking a look later this week or early next. I've uploaded the log if you want to take a look at it @mfitz :-)

gelato.log

mfitz commented 6 months ago

gelato.log

Great stuff @shriv - thanks!

From a quick look at the log:

1 - the entire process was very slow 2 - it looks like building the tables for the road charges was the dominant factor

I'll do a little more analysis later, but this is very helpful 👍

mfitz commented 5 months ago

Hi Shriv

When you get a minute, could you fill in this short Gelato feedback form?

arup-group / gelato

EC2 instance not using available memory? #85

Previous run on 128GB RAM

Current run on 64GB RAM