Out of memory error on merge step

Fiwx commented 2 months ago

Description of the bug

pgsc_calc is repeatedly failing due to out-of-memory errors during the pgscatalog-matchmerge step. The process is being killed when attempting to filter best_match variants and calculate overlap between the target genome and scoring file. This causes the pipeline to crash and requires manual intervention to resume, which sometimes can work, but typically not. With 32 GB RAM, pgscatalog-matchmerge process using up to 68GB of virtual memory and about 30GB of physical memory. It goes into swap memory, and crashes. It also does not run fully on 62 GB of RAM. I am using 63 (local) scorefiles in the run.

Apologies for the multiple opened issues; they appear to be different issues on different steps.

Here is the issue:

/home/user/org/runner/test/test1_file8270_yofsample_uk_s_uk.23andme/work/b6/78b553f872176c7c50419d0f1bcda6/.command.sh: line 9: 1357 Killed pgscatalog-matchmerge --dataset test1file8270yofsampleuksuk --scorefile scorefiles.txt.gz --matches *.ipc.zst --min_overlap 0.0 --filter_IDs filter_ids.txt.gz --outdir $PWD --combined -v

After each kill, the system attempted to reclaim memory, as indicated by "oom_reaper" messages. The pgscatalog-matchmerge command was consistently failing at the stage of "Filtering to best_match variants" and "Calculating overlap between target genome and scoring file".

Command used and terminal output

Command.log output (before the crash):

pgscatalog.core.cli.combine_cli: 2024-08-23 12:08:55 DEBUG    Verbose logging enabled
pgscatalog.core.cli.combine_cli: 2024-08-23 12:08:55 DEBUG    Compressing output with gzip
  0%|          | 0/66 [00:00<?, ?it/s]pgscatalog.core.cli.combine_cli: 2024-08-23 12:08:56 INFO     Processing PGS000014
  2%|▏         | 1/66 [01:48<1:57:35, 108.54s/it]pgscatalog.core.cli.combine_cli: 2024-08-23 12:10:44 INFO     Processing PGS000016
  3%|▎         | 2/66 [03:37<1:55:50, 108.60s/it]pgscatalog.core.cli.combine_cli: 2024-08-23 12:12:33 INFO     Processing PGS000017
[...]
pgscatalog.core.lib._normalise: 2024-08-23 12:24:22 WARNING  76 of 1059939 variants are duplicated in: PGS004516_hmPOS_GRCh37
 97%|█████████▋| 64/66 [15:32<00:29, 14.98s/it]pgscatalog.core.cli.combine_cli: 2024-08-23 12:24:28 INFO     Processing PGS004688
 98%|█████████▊| 65/66 [15:51<00:15, 15.95s/it]pgscatalog.core.cli.combine_cli: 2024-08-23 12:24:47 INFO     Processing PGS004696
100%|██████████| 66/66 [16:13<00:00, 14.75s/it]
pgscatalog.core.lib.scorefiles: 2024-08-23 12:25:09 WARNING  Mismatch between header (30) and output row count (33) for PGS000021
pgscatalog.core.lib.scorefiles: 2024-08-23 12:25:09 WARNING  This can happen with older scoring files in the PGS Catalog (e.g. PGS000028)
pgscatalog.core.lib.scorefiles: 2024-08-23 12:25:09 WARNING  Mismatch between header (67) and output row count (85) for PGS000024
pgscatalog.core.lib.scorefiles: 2024-08-23 12:25:09 WARNING  This can happen with older scoring files in the PGS Catalog (e.g. PGS000028)
pgscatalog.core.lib.scorefiles: 2024-08-23 12:25:09 WARNING  Mismatch between header (31) and output row count (33) for PGS000026
pgscatalog.core.lib.scorefiles: 2024-08-23 12:25:09 WARNING  This can happen with older scoring files in the PGS Catalog (e.g. PGS000028)
pgscatalog.core.lib.scorefiles: 2024-08-23 12:25:09 WARNING  Mismatch between header (165) and output row count (169) for PGS003405
pgscatalog.core.lib.scorefiles: 2024-08-23 12:25:09 WARNING  This can happen with older scoring files in the PGS Catalog (e.g. PGS000028)
pgscatalog.core.cli.combine_cli: 2024-08-23 12:25:09 INFO     Writing log to log_scorefiles.json
pgscatalog.core.cli.combine_cli: 2024-08-23 12:25:09 INFO     Combining complete

[This step worked.]

Process > PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (user10006file8271yofsampleuksuk chromosome ALL) - Completed

Command.log output:

pgscatalog.match.cli.merge_cli: 2024-08-23 12:52:06 DEBUG    Verbose logging enabled
pgscatalog.match.cli.merge_cli: 2024-08-23 12:52:06 INFO     --cleanup set (default), temporary files will be deleted
pgscatalog.match.lib.scoringfileframe: 2024-08-23 12:52:06 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
pgscatalog.match.lib.scoringfileframe: 2024-08-23 12:52:38 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
pgscatalog.match.lib._match.preprocess: 2024-08-23 12:52:38 DEBUG    Complementing column effect_allele
pgscatalog.match.lib._match.preprocess: 2024-08-23 12:52:38 DEBUG    Complementing column other_allele
pgscatalog.match.lib._match.label: 2024-08-23 12:52:38 DEBUG    Labelling best match type (refalt > altref > ...)
pgscatalog.match.lib._match.label: 2024-08-23 12:52:38 DEBUG    Labelling duplicated best match: keeping first instance as best_match = True
pgscatalog.match.lib._match.label: 2024-08-23 12:52:38 DEBUG    Labelling multiple scoring file lines (accession/row_nr) that best_match to the same variant
pgscatalog.match.lib._match.label: 2024-08-23 12:52:38 DEBUG    Labelling all duplicates with exclude flag
pgscatalog.match.lib._match.label: 2024-08-23 12:52:38 DEBUG    Labelling ambiguous variants
pgscatalog.match.lib._match.preprocess: 2024-08-23 12:52:38 DEBUG    Complementing column REF
pgscatalog.match.lib._match.label: 2024-08-23 12:52:38 DEBUG    Labelling ambiguous variants with exclude flag
pgscatalog.match.lib._match.label: 2024-08-23 12:52:38 DEBUG    Labelling multiallelic matches with exclude flag
pgscatalog.match.lib._match.label: 2024-08-23 12:52:38 DEBUG    Not excluding flipped matches
pgscatalog.match.lib._match.label: 2024-08-23 12:52:38 DEBUG    Reading filter file (variant IDs)
pgscatalog.match.lib._match.label: 2024-08-23 12:52:48 DEBUG    Excluding variants that are not in ID list (read 27904792 IDs)
pgscatalog.match.lib._match.filter: 2024-08-23 12:52:48 DEBUG    Filtering to best_match variants (with exclude flag = False)
pgscatalog.match.lib._match.filter: 2024-08-23 12:52:48 DEBUG    Calculating overlap between target genome and scoring file
/home/user/org/runner/test/test1_file8270_yofsample_uk_s_uk.23andme/work/b6/78b553f872176c7c50419d0f1bcda6/.command.sh: line 9:  1357 Killed                  pgscatalog-matchmerge --dataset test1file8270yofsampleuksuk --scorefile scorefiles.txt.gz --matches *.ipc.zst --min_overlap 0.0 --filter_IDs filter_ids.txt.gz --outdir $PWD --combined -v
1:29
Output of command.log after retrying:

pgscatalog.match.cli.merge_cli: 2024-08-23 13:08:35 DEBUG    Verbose logging enabled
pgscatalog.match.cli.merge_cli: 2024-08-23 13:08:35 INFO     --cleanup set (default), temporary files will be deleted
pgscatalog.match.lib.scoringfileframe: 2024-08-23 13:08:35 DEBUG    Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
pgscatalog.match.lib.scoringfileframe: 2024-08-23 13:09:03 DEBUG    ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
pgscatalog.match.lib._match.preprocess: 2024-08-23 13:09:03 DEBUG    Complementing column effect_allele
pgscatalog.match.lib._match.preprocess: 2024-08-23 13:09:03 DEBUG    Complementing column other_allele
pgscatalog.match.lib._match.label: 2024-08-23 13:09:03 DEBUG    Labelling best match type (refalt > altref > ...)
pgscatalog.match.lib._match.label: 2024-08-23 13:09:03 DEBUG    Labelling duplicated best match: keeping first instance as best_match = True
pgscatalog.match.lib._match.label: 2024-08-23 13:09:03 DEBUG    Labelling multiple scoring file lines (accession/row_nr) that best_match to the same variant
pgscatalog.match.lib._match.label: 2024-08-23 13:09:03 DEBUG    Labelling all duplicates with exclude flag
pgscatalog.match.lib._match.label: 2024-08-23 13:09:03 DEBUG    Labelling ambiguous variants
pgscatalog.match.lib._match.preprocess: 2024-08-23 13:09:03 DEBUG    Complementing column REF
pgscatalog.match.lib._match.label: 2024-08-23 13:09:03 DEBUG    Labelling ambiguous variants with exclude flag
pgscatalog.match.lib._match.label: 2024-08-23 13:09:03 DEBUG    Labelling multiallelic matches with exclude flag
pgscatalog.match.lib._match.label: 2024-08-23 13:09:03 DEBUG    Not excluding flipped matches
pgscatalog.match.lib._match.label: 2024-08-23 13:09:03 DEBUG    Reading filter file (variant IDs)
pgscatalog.match.lib._match.label: 2024-08-23 13:09:11 DEBUG    Excluding variants that are not in ID list (read 27904792 IDs)
pgscatalog.match.lib._match.filter: 2024-08-23 13:09:11 DEBUG    Filtering to best_match variants (with exclude flag = False)
pgscatalog.match.lib._match.filter: 2024-08-23 13:09:11 DEBUG    Calculating overlap between target genome and scoring file
/home/user/org/runner/test/test1_file8270_yofsample_uk_s_uk.23andme/work/31/61fb5b2cf28c97215dfe280e0b13ee/.command.sh: line 9:  1256 Killed                  pgscatalog-matchmerge --dataset test1file8270yofsampleuksuk --scorefile scorefiles.txt.gz --matches *.ipc.zst --min_overlap 0.0 --filter_IDs filter_ids.txt.gz --outdir $PWD --combined -v

Relevant files

No response

System information

pgscatalog/pgsc_calc: v2.0.0-beta.3 profile: Singularity Both on 64 and 32 GB of RAM Nextflow version: 24.04.4

smlmbrt commented 1 month ago

The solution here would be to run it on less scores at once, or allocate it additional memory.

Fiwx commented 1 month ago

Thanks! @smlmbrt, is the memory-intensive part likely the "Calculating overlap between target genome and scoring file" step here? Or do you think that is right before the memory-intensive step?

smlmbrt commented 1 month ago

Perhaps, but it could also be the step write after (pivoting the df and writing it to disk) that doesn't complete. Either way, there's no getting around the fact that the more variants you have in scoring files the more memory it will take.

PGScatalog / pgsc_calc