Open chaoran-chen opened 4 weeks ago
What I found so far: SILO restarts when there is new data with a new data version on s0. This causes SILO to return a 503 until the data is loaded. This does not happen locally.
The corresponding logs: local:
silo-1 | [2024-06-13 06:45:22.970] [logger] [info] [logging_request_handler.cpp:21] Request Id [f54bb0b4-a8d0-43e5-8d66-639edc02a8fd] - Responding with status code 200
silo-1 | [2024-06-13 06:45:23.963] [logger] [info] [logging_request_handler.cpp:15] Request Id [c6f964d0-ac57-491e-9a0f-d6081dbd0744] - Handling GET /info
silo-1 | [2024-06-13 06:45:23.971] [logger] [info] [logging_request_handler.cpp:21] Request Id [c6f964d0-ac57-491e-9a0f-d6081dbd0744] - Responding with status code 200
silo-1 | [2024-06-13 06:45:24.315] [logger] [info] [database_directory_watcher.cpp:138] New data version detected: /data/1718261119
silo-1 | [2024-06-13 06:45:24.315] [logger] [info] [database_config.cpp:202] Reading database config from /data/1718261119/database_config.yaml
silo-1 | [2024-06-13 06:45:24.317] [logger] [info] [database.cpp:534] Finished loading partitions from /data/1718261119/aa_sequences.silo
...
silo-1 | [2024-06-13 06:45:29.668] [logger] [info] [database.cpp:564] Finished loading partition data
silo-1 | [2024-06-13 06:45:29.668] [logger] [info] [database.cpp:567] Finished loading data_version from /data/1718261119/data_version.silo
silo-1 | [2024-06-13 06:45:29.673] [logger] [info] [database.cpp:570] Database info after loading: sequence count: 100, total size: 26'585'860, N bitmaps size: 3'370
silo-1 | [2024-06-13 06:45:29.737] [logger] [info] [database_directory_watcher.cpp:154] New database with version /data/1718261119 successfully loaded.
silo-1 | [2024-06-13 06:45:29.965] [logger] [info] [logging_request_handler.cpp:15] Request Id [51ea6cbe-c9f6-4478-9456-77dc14075633] - Handling GET /info
silo-1 | [2024-06-13 06:45:29.973] [logger] [info] [logging_request_handler.cpp:21] Request Id [51ea6cbe-c9f6-4478-9456-77dc14075633] - Responding with status code 200
silo-1 | [2024-06-13 06:45:30.546] [logger] [info] [logging_request_handler.cpp:15] Request Id [c9e42893-0789-4e3e-aa85-91095513b79f] - Handling POST /query
On s0:
[2024-06-12 20:26:53.094] [logger] [info] [logging_request_handler.cpp:15] Request Id [c3af7091-0101-4c46-96f8-da43e5afe8f1] - Handling POST /query
[2024-06-12 20:26:53.094] [logger] [info] [query_handler.cpp:32] Request Id [c3af7091-0101-4c46-96f8-da43e5afe8f1] - received query: {"action":{"groupByFields":["age","hospitalized","died"],"randomize":false,"type":"Aggregated"},"filterExpression":{"children":[{"column":"date","from":"2020-01-06","to":"2024-06-12","type":"DateBetween"},{"children":[{"column":"host","value":"Human","type":"StringEquals"}],"type":"Or"}],"type":"And"}}
[2024-06-12 20:26:53.094] [logger] [info] [logging_request_handler.cpp:15] Request Id [fc3686d7-02a6-4628-a9f2-e643c871a17b] - Handling POST /query
[2024-06-12 20:26:53.094] [logger] [info] [query_handler.cpp:32] Request Id [fc3686d7-02a6-4628-a9f2-e643c871a17b] - received query: {"action":{"groupByFields":["age"],"randomize":false,"type":"Aggregated"},"filterExpression":{"children":[{"column":"date","from":"2020-01-06","to":"2024-06-12","type":"DateBetween"},{"children":[{"column":"nextcladePangoLineage","value":"KP.3","includeSublineages":false,"type":"PangoLineage"}],"type":"Or"},{"children":[{"column":"host","value":"Human","type":"StringEquals"}],"type":"Or"},{"sequenceName":"S","position":346,"symbol":"T","type":"AminoAcidEquals"}],"type":"And"}}
[2024-06-12 20:27:08.603] [logger] [info] [api.cpp:312] Starting SILO
[2024-06-12 20:27:08.603] [logger] [info] [api.cpp:209] Starting SILO API
[2024-06-12 20:27:08.603] [logger] [info] [yaml_file.cpp:11] Reading config from ./runtime_config.yaml
[2024-06-12 20:27:08.604] [logger] [info] [api.cpp:229] Using 10000 queued http connections
[2024-06-12 20:27:08.604] [logger] [info] [api.cpp:232] Using 32 threads for http connections
[2024-06-12 20:27:08.604] [logger] [info] [api.cpp:243] Listening on port 8081
[2024-06-12 20:27:08.625] [logger] [info] [database_directory_watcher.cpp:138] New data version detected: /data/1718186168
[2024-06-12 20:27:08.625] [logger] [info] [database_config.cpp:212] Reading database config from /data/1718186168/database_config.yaml
[2024-06-12 20:27:20.155] [logger] [info] [database.cpp:534] Finished loading partitions from /data/1718186168/aa_sequences.silo
[2024-06-12 20:27:38.647] [logger] [info] [logging_request_handler.cpp:15] Request Id [967e363d-7555-44b9-b9c0-2dff7d7855c1] - Handling GET /info
[2024-06-12 20:27:38.647] [logger] [info] [error_request_handler.cpp:30] Caught exception: Database not initialized yet
[2024-06-12 20:27:38.650] [logger] [info] [logging_request_handler.cpp:21] Request Id [967e363d-7555-44b9-b9c0-2dff7d7855c1] - Responding with status code 503
We certainly have some sort of memory leak in SILO when loading new data versions. I took the Ebola dataset from this LAPIS branch and imported it in SILO. Then I restarted the SILO preprocessing on the same dataset a couple times so that SILO loads a new data version. I observed the memory, those are the results:
firstTry secondTry
0 146 MB 146 MB
1 294 MB 310 MB
2 417 MB 461 MB
3 562 MB 568 MB
4 553 MB 560 MB
5 594 MB 579 MB
6 669 MB
7 664 MB
8 629 MB
(Note that the memory usage of SILO fluctuates a bit anyway. The time interval between the reprocessings was not constant.)
I seems that initially, it always increases the memory usage, but reaches a plateau then. I was also able to observe the initial ramp up with the small dummy dataset.
The COVID LAPIS is down for 10-15 minutes every time shortly after the ingest pipeline has finished.