PGScatalog / pygscatalog

Python applications and libraries for working with PGS data and the PGS Catalog
https://pygscatalog.readthedocs.io/en/latest/
Apache License 2.0
6 stars 1 forks source link

Combine scorefiles scalability/reliability? #14

Closed smlmbrt closed 6 months ago

smlmbrt commented 6 months ago

I found out that PGS000192 seems to lead to problems. Without it, there are no error messages for now.

However, the process is now running for 5 days, and with a current memory usage of 190 Gb I am running into swap memory. Thus, I am afraid it will not finish in reasonable time. Is there a way to reduce resource requirements for the combination of score files?

_Originally posted by @mfasold in https://github.com/PGScatalog/pgscatalog_utils/issues/85#issuecomment-2031218446_

smlmbrt commented 6 months ago
          Yes, the `pgscatalog-combine` uses up the resources. I don't really know what part of it. All I see is a console log like this (snippet)
pgscatalog.core.lib._normalise: 2024-04-03 09:17:48 WARNING  173 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:17:52 WARNING  161 of 2209179 variants are duplicated in: PGS000644
pgscatalog.core.lib._normalise: 2024-04-03 09:18:13 WARNING  134 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:18:13 WARNING  126 of 2100302 variants are duplicated in: PGS000027
pgscatalog.core.lib._normalise: 2024-04-03 09:19:05 WARNING  Multiple other_alleles detected in 18 variants
pgscatalog.core.lib._normalise: 2024-04-03 09:19:05 WARNING  Other allele for these variants is set to missing
pgscatalog.core.lib._normalise: 2024-04-03 09:20:28 WARNING  10 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:20:30 WARNING  4 of 1155382 variants are duplicated in: PGS003806
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING  11 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING  Multiple other_alleles detected in 1266 variants
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING  Other allele for these variants is set to missing
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING  8 of 833480 variants are duplicated in: PGS003429
pgscatalog.core.lib._normalise: 2024-04-03 09:22:29 WARNING  4 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING  2 variants have invalid effect alleles (not ACTG)
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING  Complex scoring file detected
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING  Complex files are difficult to calculate properly and may require manual intervention
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING  1 of 2168 variants are duplicated in: PGS000962
pgscatalog.core.lib._normalise: 2024-04-03 09:23:48 WARNING  10 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:23:48 WARNING  7 of 83000 variants are duplicated in: PGS003721
pgscatalog.core.lib._normalise: 2024-04-03 09:23:50 WARNING  Complex scoring file detected
pgscatalog.core.lib._normalise: 2024-04-03 09:23:50 WARNING  Complex files are difficult to calculate properly and may require manual intervention
pgscatalog.core.lib._normalise: 2024-04-03 09:25:07 WARNING  14 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:25:07 WARNING  7 of 1117400 variants are duplicated in: PGS004613
pgscatalog.core.lib._normalise: 2024-04-03 09:36:07 WARNING  93 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:36:08 WARNING  Multiple other_alleles detected in 212909 variants
pgscatalog.core.lib._normalise: 2024-04-03 09:36:08 WARNING  Other allele for these variants is set to missing
pgscatalog.core.lib._normalise: 2024-04-03 09:36:08 WARNING  89 of 1059939 variants are duplicated in: PGS004359
pgscatalog.core.lib._normalise: 2024-04-03 09:37:29 WARNING  22 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:37:29 WARNING  6 of 30603 variants are duplicated in: PGS002452

It would be interesting to know if it is worth waiting. Some screen pages up I see the following message

1%|██ | 61/4488 [41:14:47<49066:42:55, 39900.65s/it]

Together with the fact that the memory need is only increasing, there is not much hope, is it?

_Originally posted by @mfasold in https://github.com/PGScatalog/pgscatalog_utils/issues/85#issuecomment-2033827381_

nebfield commented 6 months ago

This was a great stress test of the new code 🤠 I only tested combining 100 - 200 randomly selected scoring files at a time during development.

The simplest fix for memory problems is to disable concurrency for now. It's slower but it works and memory usage is minimal when working with ~4000 files. In the future we may want to think about enabling concurrency and writing output a different way (a binary format? one output file per input file?).

mfasold commented 6 months ago

Glad to help...

There is one unmentioned problem case in the range PGS003416 .. PGS003433

nebfield commented 6 months ago

That range is OK now with v1.0.1:

$ time pgscatalog-combine -s  PGS003416_hmPOS_GRCh38.txt.gz PGS003417_hmPOS_GRCh38.txt.gz PGS003418_hmPOS_GRCh38.txt.gz PGS003419_hmPOS_GRCh38.txt.gz PGS003420_hmPOS_GRCh38.txt.gz PGS003421_hmPOS_GRCh38.txt.gz PGS003422_hmPOS_GRCh38.txt.gz PGS003423_hmPOS_GRCh38.txt.gz PGS003424_hmPOS_GRCh38.txt.gz PGS003427_hmPOS_GRCh38.txt.gz PGS003428_hmPOS_GRCh38.txt.gz PGS003430_hmPOS_GRCh38.txt.gz PGS003431_hmPOS_GRCh38.txt.gz PGS003432_hmPOS_GRCh38.txt.gz PGS003433_hmPOS_GRCh38.txt.gz PGS003434_hmPOS_GRCh38.txt.gz PGS003435_hmPOS_GRCh38.txt.gz PGS003436_hmPOS_GRCh38.txt.gz PGS003437_hmPOS_GRCh38.txt.gz PGS003438_hmPOS_GRCh38.txt.gz PGS003439_hmPOS_GRCh38.txt.gz PGS003440_hmPOS_GRCh38.txt.gz PGS003441_hmPOS_GRCh38.txt.gz PGS003442_hmPOS_GRCh38.txt.gz PGS003443_hmPOS_GRCh38.txt.gz -t GRCh38 -o test.txt.gz -v
58.71s user 1.45s system 101% cpu 59.464 total

I was also able to combine all scoring files in the range PGS000001 through PGS003900.

Thanks for your help @mfasold 🥳 Please let us know if you experience more problems