Closed smlmbrt closed 6 months ago
Yes, the `pgscatalog-combine` uses up the resources. I don't really know what part of it. All I see is a console log like this (snippet)
pgscatalog.core.lib._normalise: 2024-04-03 09:17:48 WARNING 173 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:17:52 WARNING 161 of 2209179 variants are duplicated in: PGS000644
pgscatalog.core.lib._normalise: 2024-04-03 09:18:13 WARNING 134 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:18:13 WARNING 126 of 2100302 variants are duplicated in: PGS000027
pgscatalog.core.lib._normalise: 2024-04-03 09:19:05 WARNING Multiple other_alleles detected in 18 variants
pgscatalog.core.lib._normalise: 2024-04-03 09:19:05 WARNING Other allele for these variants is set to missing
pgscatalog.core.lib._normalise: 2024-04-03 09:20:28 WARNING 10 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:20:30 WARNING 4 of 1155382 variants are duplicated in: PGS003806
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING 11 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING Multiple other_alleles detected in 1266 variants
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING Other allele for these variants is set to missing
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING 8 of 833480 variants are duplicated in: PGS003429
pgscatalog.core.lib._normalise: 2024-04-03 09:22:29 WARNING 4 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING 2 variants have invalid effect alleles (not ACTG)
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING Complex scoring file detected
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING Complex files are difficult to calculate properly and may require manual intervention
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING 1 of 2168 variants are duplicated in: PGS000962
pgscatalog.core.lib._normalise: 2024-04-03 09:23:48 WARNING 10 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:23:48 WARNING 7 of 83000 variants are duplicated in: PGS003721
pgscatalog.core.lib._normalise: 2024-04-03 09:23:50 WARNING Complex scoring file detected
pgscatalog.core.lib._normalise: 2024-04-03 09:23:50 WARNING Complex files are difficult to calculate properly and may require manual intervention
pgscatalog.core.lib._normalise: 2024-04-03 09:25:07 WARNING 14 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:25:07 WARNING 7 of 1117400 variants are duplicated in: PGS004613
pgscatalog.core.lib._normalise: 2024-04-03 09:36:07 WARNING 93 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:36:08 WARNING Multiple other_alleles detected in 212909 variants
pgscatalog.core.lib._normalise: 2024-04-03 09:36:08 WARNING Other allele for these variants is set to missing
pgscatalog.core.lib._normalise: 2024-04-03 09:36:08 WARNING 89 of 1059939 variants are duplicated in: PGS004359
pgscatalog.core.lib._normalise: 2024-04-03 09:37:29 WARNING 22 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:37:29 WARNING 6 of 30603 variants are duplicated in: PGS002452
It would be interesting to know if it is worth waiting. Some screen pages up I see the following message
1%|██ | 61/4488 [41:14:47<49066:42:55, 39900.65s/it]
Together with the fact that the memory need is only increasing, there is not much hope, is it?
_Originally posted by @mfasold in https://github.com/PGScatalog/pgscatalog_utils/issues/85#issuecomment-2033827381_
This was a great stress test of the new code 🤠 I only tested combining 100 - 200 randomly selected scoring files at a time during development.
is_recessive
) instead of two (is_dominant
). This is valid according to PGS Catalog standards but I didn't expect itThe simplest fix for memory problems is to disable concurrency for now. It's slower but it works and memory usage is minimal when working with ~4000 files. In the future we may want to think about enabling concurrency and writing output a different way (a binary format? one output file per input file?).
Glad to help...
There is one unmentioned problem case in the range PGS003416 .. PGS003433
That range is OK now with v1.0.1
:
$ time pgscatalog-combine -s PGS003416_hmPOS_GRCh38.txt.gz PGS003417_hmPOS_GRCh38.txt.gz PGS003418_hmPOS_GRCh38.txt.gz PGS003419_hmPOS_GRCh38.txt.gz PGS003420_hmPOS_GRCh38.txt.gz PGS003421_hmPOS_GRCh38.txt.gz PGS003422_hmPOS_GRCh38.txt.gz PGS003423_hmPOS_GRCh38.txt.gz PGS003424_hmPOS_GRCh38.txt.gz PGS003427_hmPOS_GRCh38.txt.gz PGS003428_hmPOS_GRCh38.txt.gz PGS003430_hmPOS_GRCh38.txt.gz PGS003431_hmPOS_GRCh38.txt.gz PGS003432_hmPOS_GRCh38.txt.gz PGS003433_hmPOS_GRCh38.txt.gz PGS003434_hmPOS_GRCh38.txt.gz PGS003435_hmPOS_GRCh38.txt.gz PGS003436_hmPOS_GRCh38.txt.gz PGS003437_hmPOS_GRCh38.txt.gz PGS003438_hmPOS_GRCh38.txt.gz PGS003439_hmPOS_GRCh38.txt.gz PGS003440_hmPOS_GRCh38.txt.gz PGS003441_hmPOS_GRCh38.txt.gz PGS003442_hmPOS_GRCh38.txt.gz PGS003443_hmPOS_GRCh38.txt.gz -t GRCh38 -o test.txt.gz -v
58.71s user 1.45s system 101% cpu 59.464 total
I was also able to combine all scoring files in the range PGS000001
through PGS003900
.
Thanks for your help @mfasold 🥳 Please let us know if you experience more problems
_Originally posted by @mfasold in https://github.com/PGScatalog/pgscatalog_utils/issues/85#issuecomment-2031218446_