NASA-AMMOS / aerie

A software framework for modeling spacecraft.
https://nasa-ammos.github.io/aerie-docs/
MIT License
73 stars 19 forks source link

Sequencing service crashes attempting to process command expansion #1602

Open parkerabercrombie opened 2 weeks ago

parkerabercrombie commented 2 weeks ago

Checked for duplicates

No - I haven't checked

Is this a regression?

No - This is a new bug

Version

2.11.2

Describe the bug

The Aerie sequencing service is crashing when attempting to expand one of our plans. We attempted to expand the plan, observed that the request seems to be hung and the server memory pegged at 93%. We restarted the service and re-submitted the expansion request. Again the request seemed to hang and the service seemed to crash and restart itself. On the third attempt the expansion succeeded.

Reproduction

Has occurred on 2/3 attempts to expand our cruise002 plan.

Logs

This error appears in logs (not sure if related to crash):

2024-11-12 14:48:33.813 
    at trimPrefix (/app/node_modules/router/index.js:330:13)

2024-11-12 14:48:33.813 
    at Layer.handleRequest (/app/node_modules/router/lib/layer.js:101:15)
2024-11-12 14:48:33.813 
    at jsonParser (/app/node_modules/body-parser/lib/types/json.js:110:7)
2024-11-12 14:48:33.813 
    at next (/app/node_modules/router/index.js:282:5)
2024-11-12 14:48:33.813 
    at processParams (/app/node_modules/router/index.js:568:12)
2024-11-12 14:48:33.813 
    at /app/node_modules/router/index.js:291:7
2024-11-12 14:48:33.813 
    at trimPrefix (/app/node_modules/router/index.js:330:13)
2024-11-12 14:48:33.813 
    at Layer.handleRequest (/app/node_modules/router/lib/layer.js:101:15)
2024-11-12 14:48:33.813 
    at file:///app/build/app.js:54:62
2024-11-12 14:48:33.813 
    at getHasuraSession (file:///app/build/utils/hasura.js:57:11)
2024-11-12 14:48:33.813 
Error: Could not determine the user sending the request
2024-11-12 14:48:33.813 
ERROR app - Could not determine the user sending the request

Full log Aerie Sequencing Logs-data-2024-11-12 15_31_27.csv


### System Info

```shell
Chrome

Severity

Critical

goetzrrGit commented 1 week ago

Aerie version 2.11.2 did not utilize jemalloc garbage collection, instead relying on Node.js' default memory allocator. This was addressed in Aerie version 2.14, where jemalloc was added to the sequencing server Dockerfile. Here is the PR where this change was added.

https://github.com/NASA-AMMOS/aerie/pull/1487

Any version of Aerie after 2.14 will have this update and fix the memory leak problem. I verified with the Clipper's Data that @parkerabercrombie sent to the Aerie team. After running 32 expansion runs the memory held and was cleanup at 4 gb without any server crashes.

FWI

For an optimal experience with sequence expansion, we recommend upgrading your Virtual Machine (VM) configuration to include:

A more powerful CPU Increased RAM This will help ensure smoother performance and faster processing times.

Keep in mind that there is a specific bottleneck on the cpu. From our logs, it appears that you'll need to wait approximately 13 minutes after a server restart, before expanding your plan into sequences, as the server needs time to transpile the Expansion logic files. For reference, on a Mac M1 or above, this processing time can be reduced to around 2 minutes.

parkerabercrombie commented 6 days ago

Thanks @goetzrrGit. We'll try increasing the resources on that server.