Open Fiwx opened 1 month ago
I believe there may be a problem with heapq or something similar in intersect_cli.py. Running the same command with ~20 scores worked fine.
I believe there may be a problem with heapq or something similar in intersect_cli.py. Running the same command with ~20 scores worked fine.
This step with intersect_cli.py
shouldn't be dependant on the number of scores, it sounds like it was just a random error or out of memory bug?
When attempting to merge these temporary files and write to the final output, the system ran out of available RAM. This memory exhaustion caused the heapq.merge operation to fail silently, right after the file is opened and the header is written. As a result, only the header was written to the reference_variants.txt.gz file before the process was interrupted. Inspecting the Nextflow run did not confirm or reject this idea. Perhaps it needs more than 4 GB?
We will look into that (cc @nebfield)
I don't know why, but I am able to replicate the error running on many scores, and it goes away with fewer scores. Perhaps some other step is using more resources in the background, but this task fails instead, if this step is truly not dependent on the number of scores?
If you run the .command.run
script in the failed job's work directory alone does it run to completion or fail?
The same error/ failure occurs with .command.run.
The same error/ failure occurs with .command.run.
If you edit that script to request more memory does it solve the problem?
Is there a way to do this at the beginning of the run? Such as a configuration file that can be modified?
Could replace the process_low
label with process_high_memory
.
Thank you; I will try that. I see the memory label in conf/base.config.
Then, in modules/local/ancestry/intersect_variants.nf, I will change:
process INTERSECT_VARIANTS {
// labels are defined in conf/modules.config
label 'process_single'
label 'pgscatalog_utils' // controls conda, docker, + singularity options
to:
process INTERSECT_VARIANTS {
// labels are defined in conf/modules.config
label 'process_single'
label 'pgscatalog_utils' // controls conda, docker, + singularity options
label 'process_high_memory'
Description of the bug
reference_variants.txt.gz is empty, containing only the header. This problem did not occur when running ~30 scores, but it occur when running ~100 scores. This causes the pgscatalog-intersect step to crash.
Command used and terminal output
Code: https://github.com/PGScatalog/pygscatalog/blob/main/pgscatalog.match/src/pgscatalog/match/cli/intersect_cli.py
Since reference_variants is empty, the error makes sense.
Data Processing: The script successfully processed 84,805,772 reference variants and wrote them to temporary files. The temporary files were correctly written with the expected structure, including the "CHR:POS:A0:A1" column.
Idea: When attempting to merge these temporary files and write to the final output, the system ran out of available RAM. This memory exhaustion caused the heapq.merge operation to fail silently, right after the file is opened and the header is written. As a result, only the header was written to the reference_variants.txt.gz file before the process was interrupted. Inspecting the Nextflow run did not confirm or reject this idea. Perhaps it needs more than 4 GB?
PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_VCF task:
PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:EXTRACT_DATABASE task:
PGSCATALOG_PGSCCALC:PGSCCALC:INPUT_CHECK:COMBINE_SCOREFILES task:
PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS task:
Relevant files
reference_variants.txt only contains only the following:
=== Contents of reference_variants.txt === CHR:POS:A0:A1 ID_REF REF_REF IS_INDEL STRANDAMB IS_MA_REF
=== Contents of GRCh37_1000G_ALL.psam ===
IID PAT MAT SEX SuperPop Population
HG00096 0 0 1 EUR GBR HG00097 0 0 2 EUR GBR HG00099 0 0 2 EUR GBR HG00100 0 0 2 EUR GBR
=== Contents of GRCh37_file127_ALL.afreq.gz ===
CHROM ID REF ALT ALT_FREQS OBS_CT
1 1:10642:G:A G A 0 2 1 1:11008:C:G C G 0 2 1 1:11012:C:G C G 0 2 1 1:11063:T:G T G 0 2
=== Contents of GRCh37_file127_ALL.vmiss.gz ===
ID F_MISS_DOSAGE F_MISS
1:10642:G:A 0 0 1:11008:C:G 0 0 1:11012:C:G 0 0 1:11063:T:G 0 0
=== Contents of GRCh37_1000G_ALL.pvar.zst ===
reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
contig=
contig=
contig=
contig=
=== Contents of GRCh37_file127_ALL.pvar.zst ===
CHROM POS ID REF ALT
1 10642 1:10642:G:A G A 1 11008 1:11008:C:G C G 1 11012 1:11012:C:G C G 1 11063 1:11063:T:G T G
--- Contents of GRCh37_1000G_ALL.psam ---
IID PAT MAT SEX SuperPop Population
HG00096 0 0 1 EUR GBR HG00097 0 0 2 EUR GBR HG00099 0 0 2 EUR GBR HG00100 0 0 2 EUR GBR
--- Contents of GRCh37_file127_ALL.psam ---
IID SEX
file_127.ancestry.txt NA
=== Contents of /tmp/tmpjo6asoom/tmpchbm3n41 === CHR:POS:A0:A1 ID_REF REF_REF IS_INDEL STRANDAMB IS_MA_REF 1:10000006:A:G 1:10000006:G:A G False False False 1:10000020:A:T 1:10000020:T:A T False True False 1:10000072:C:T 1:10000072:C:T C False False False 1:10000143:C:T 1:10000143:C:T C False False False 1:10000160:C:G 1:10000160:G:C G False True False 1:10000179:A:AAAAAAAC 1:10000179:AAAAAAAC:A AAAAAAAC True False False 1:10000185:A:C 1:10000185:A:C A False False False 1:10000186:C:G 1:10000186:C:G C False True False 1:10000228:C:T 1:10000228:T:C T False False False 1:10000236:C:T 1:10000236:T:C T False False False 1:10000283:A:G 1:10000283:G:A G False False False 1:10000302:A:T 1:10000302:T:A T False True False 1:10000320:C:T 1:10000320:C:T C False False False 1:10000327:C:T 1:10000327:C:T C False False False 1:10000354:C:T 1:10000354:C:T C False False False 1:10000371:A:T 1:10000371:A:T A False True False 1:1000037:A:G 1:1000037:A:G A False False False 1:10000396:A:G 1:10000396:A:G A False False False 1:10000400:A:T 1:10000400:T:A T False True False
System information
Information: pgscatalog/pgsc_calc v2.0.0-beta.3 profile : singularity CPUs: 4 - Mem: 31 GB (3.3 GB) - Swap: 0 (0) Nextflow version: 24.04.4