GenSpectrum / LAPIS-SILO

Sequence Indexing engine for Large Order of genomic data
GNU Affero General Public License v3.0
10 stars 3 forks source link

SILO (or LAPIS) is down after loading new data #470

Open chaoran-chen opened 4 weeks ago

chaoran-chen commented 4 weeks ago

The COVID LAPIS is down for 10-15 minutes every time shortly after the ingest pipeline has finished.

JonasKellerer commented 3 weeks ago

What I found so far: SILO restarts when there is new data with a new data version on s0. This causes SILO to return a 503 until the data is loaded. This does not happen locally.

The corresponding logs: local:

silo-1  | [2024-06-13 06:45:22.970] [logger] [info] [logging_request_handler.cpp:21] Request Id [f54bb0b4-a8d0-43e5-8d66-639edc02a8fd] - Responding with status code 200
silo-1  | [2024-06-13 06:45:23.963] [logger] [info] [logging_request_handler.cpp:15] Request Id [c6f964d0-ac57-491e-9a0f-d6081dbd0744] - Handling GET /info
silo-1  | [2024-06-13 06:45:23.971] [logger] [info] [logging_request_handler.cpp:21] Request Id [c6f964d0-ac57-491e-9a0f-d6081dbd0744] - Responding with status code 200
silo-1  | [2024-06-13 06:45:24.315] [logger] [info] [database_directory_watcher.cpp:138] New data version detected: /data/1718261119
silo-1  | [2024-06-13 06:45:24.315] [logger] [info] [database_config.cpp:202] Reading database config from /data/1718261119/database_config.yaml
silo-1  | [2024-06-13 06:45:24.317] [logger] [info] [database.cpp:534] Finished loading partitions from /data/1718261119/aa_sequences.silo
...
silo-1  | [2024-06-13 06:45:29.668] [logger] [info] [database.cpp:564] Finished loading partition data
silo-1  | [2024-06-13 06:45:29.668] [logger] [info] [database.cpp:567] Finished loading data_version from /data/1718261119/data_version.silo
silo-1  | [2024-06-13 06:45:29.673] [logger] [info] [database.cpp:570] Database info after loading: sequence count: 100, total size: 26'585'860, N bitmaps size: 3'370
silo-1  | [2024-06-13 06:45:29.737] [logger] [info] [database_directory_watcher.cpp:154] New database with version /data/1718261119 successfully loaded.
silo-1  | [2024-06-13 06:45:29.965] [logger] [info] [logging_request_handler.cpp:15] Request Id [51ea6cbe-c9f6-4478-9456-77dc14075633] - Handling GET /info
silo-1  | [2024-06-13 06:45:29.973] [logger] [info] [logging_request_handler.cpp:21] Request Id [51ea6cbe-c9f6-4478-9456-77dc14075633] - Responding with status code 200
silo-1  | [2024-06-13 06:45:30.546] [logger] [info] [logging_request_handler.cpp:15] Request Id [c9e42893-0789-4e3e-aa85-91095513b79f] - Handling POST /query

On s0:

[2024-06-12 20:26:53.094] [logger] [info] [logging_request_handler.cpp:15] Request Id [c3af7091-0101-4c46-96f8-da43e5afe8f1] - Handling POST /query
[2024-06-12 20:26:53.094] [logger] [info] [query_handler.cpp:32] Request Id [c3af7091-0101-4c46-96f8-da43e5afe8f1] - received query: {"action":{"groupByFields":["age","hospitalized","died"],"randomize":false,"type":"Aggregated"},"filterExpression":{"children":[{"column":"date","from":"2020-01-06","to":"2024-06-12","type":"DateBetween"},{"children":[{"column":"host","value":"Human","type":"StringEquals"}],"type":"Or"}],"type":"And"}}
[2024-06-12 20:26:53.094] [logger] [info] [logging_request_handler.cpp:15] Request Id [fc3686d7-02a6-4628-a9f2-e643c871a17b] - Handling POST /query
[2024-06-12 20:26:53.094] [logger] [info] [query_handler.cpp:32] Request Id [fc3686d7-02a6-4628-a9f2-e643c871a17b] - received query: {"action":{"groupByFields":["age"],"randomize":false,"type":"Aggregated"},"filterExpression":{"children":[{"column":"date","from":"2020-01-06","to":"2024-06-12","type":"DateBetween"},{"children":[{"column":"nextcladePangoLineage","value":"KP.3","includeSublineages":false,"type":"PangoLineage"}],"type":"Or"},{"children":[{"column":"host","value":"Human","type":"StringEquals"}],"type":"Or"},{"sequenceName":"S","position":346,"symbol":"T","type":"AminoAcidEquals"}],"type":"And"}}
[2024-06-12 20:27:08.603] [logger] [info] [api.cpp:312] Starting SILO
[2024-06-12 20:27:08.603] [logger] [info] [api.cpp:209] Starting SILO API
[2024-06-12 20:27:08.603] [logger] [info] [yaml_file.cpp:11] Reading config from ./runtime_config.yaml
[2024-06-12 20:27:08.604] [logger] [info] [api.cpp:229] Using 10000 queued http connections
[2024-06-12 20:27:08.604] [logger] [info] [api.cpp:232] Using 32 threads for http connections
[2024-06-12 20:27:08.604] [logger] [info] [api.cpp:243] Listening on port 8081
[2024-06-12 20:27:08.625] [logger] [info] [database_directory_watcher.cpp:138] New data version detected: /data/1718186168
[2024-06-12 20:27:08.625] [logger] [info] [database_config.cpp:212] Reading database config from /data/1718186168/database_config.yaml
[2024-06-12 20:27:20.155] [logger] [info] [database.cpp:534] Finished loading partitions from /data/1718186168/aa_sequences.silo
[2024-06-12 20:27:38.647] [logger] [info] [logging_request_handler.cpp:15] Request Id [967e363d-7555-44b9-b9c0-2dff7d7855c1] - Handling GET /info
[2024-06-12 20:27:38.647] [logger] [info] [error_request_handler.cpp:30] Caught exception: Database not initialized yet
[2024-06-12 20:27:38.650] [logger] [info] [logging_request_handler.cpp:21] Request Id [967e363d-7555-44b9-b9c0-2dff7d7855c1] - Responding with status code 503
fengelniederhammer commented 2 weeks ago

We certainly have some sort of memory leak in SILO when loading new data versions. I took the Ebola dataset from this LAPIS branch and imported it in SILO. Then I restarted the SILO preprocessing on the same dataset a couple times so that SILO loads a new data version. I observed the memory, those are the results:

   firstTry secondTry
0  146 MB   146 MB
1  294 MB   310 MB
2  417 MB   461 MB
3  562 MB   568 MB
4  553 MB   560 MB
5  594 MB   579 MB
6  669 MB
7  664 MB
8  629 MB

(Note that the memory usage of SILO fluctuates a bit anyway. The time interval between the reprocessings was not constant.)

I seems that initially, it always increases the memory usage, but reaches a plateau then. I was also able to observe the initial ramp up with the small dummy dataset.