PGScatalog / pgscatalog_utils

(superseded by pygscatalog) Utilities for working with PGS Catalog API and scoring files
Apache License 2.0
4 stars 3 forks source link

Cannot combine all scorefiles #85

Closed mfasold closed 4 months ago

mfasold commented 5 months ago

Following this discussion, I am trying to create one big scorefile which combines all scorefiles from the catalog.

Hence, I first downloaded all individual scorefiles using

for pgsid in $(cat pgs_scores_list.txt); do echo $pgsid; pgscatalog-download -i $pgsid -o . -b GRCh38

I then try to combine them using

pgscatalog-combine -s PGS*.txt.gz -t GRCh38 -o combined.txt

This leads to the following error:

...
  File "XXX/venv/lib/python3.11/site-packages/pgscatalog/core/lib/genomebuild.py", line 51, in from_string
    raise ValueError(f"Can't match {build=}")
ValueError: Can't match build='NCBI35'

Any ideas on how to proceed?

mfasold commented 4 months ago

I found out that PGS000192 seems to lead to problems. Without it, there are no error messages for now.

However, the process is now running for 5 days, and with a current memory usage of 190 Gb I am running into swap memory. Thus, I am afraid it will not finish in reasonable time. Is there a way to reduce resource requirements for the combination of score files?

smlmbrt commented 4 months ago

Which part is using 190 GB, I assume the matching? There's not really a way to get around this - you could match smaller subsets of files and re-combine the matches to make the final scoring file, but we haven't really tested it at this scale.

mfasold commented 4 months ago

Yes, the pgscatalog-combine uses up the resources. I don't really know what part of it. All I see is a console log like this (snippet)

pgscatalog.core.lib._normalise: 2024-04-03 09:17:48 WARNING  173 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:17:52 WARNING  161 of 2209179 variants are duplicated in: PGS000644
pgscatalog.core.lib._normalise: 2024-04-03 09:18:13 WARNING  134 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:18:13 WARNING  126 of 2100302 variants are duplicated in: PGS000027
pgscatalog.core.lib._normalise: 2024-04-03 09:19:05 WARNING  Multiple other_alleles detected in 18 variants
pgscatalog.core.lib._normalise: 2024-04-03 09:19:05 WARNING  Other allele for these variants is set to missing
pgscatalog.core.lib._normalise: 2024-04-03 09:20:28 WARNING  10 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:20:30 WARNING  4 of 1155382 variants are duplicated in: PGS003806
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING  11 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING  Multiple other_alleles detected in 1266 variants
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING  Other allele for these variants is set to missing
pgscatalog.core.lib._normalise: 2024-04-03 09:22:15 WARNING  8 of 833480 variants are duplicated in: PGS003429
pgscatalog.core.lib._normalise: 2024-04-03 09:22:29 WARNING  4 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING  2 variants have invalid effect alleles (not ACTG)
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING  Complex scoring file detected
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING  Complex files are difficult to calculate properly and may require manual intervention
pgscatalog.core.lib._normalise: 2024-04-03 09:22:30 WARNING  1 of 2168 variants are duplicated in: PGS000962
pgscatalog.core.lib._normalise: 2024-04-03 09:23:48 WARNING  10 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:23:48 WARNING  7 of 83000 variants are duplicated in: PGS003721
pgscatalog.core.lib._normalise: 2024-04-03 09:23:50 WARNING  Complex scoring file detected
pgscatalog.core.lib._normalise: 2024-04-03 09:23:50 WARNING  Complex files are difficult to calculate properly and may require manual intervention
pgscatalog.core.lib._normalise: 2024-04-03 09:25:07 WARNING  14 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:25:07 WARNING  7 of 1117400 variants are duplicated in: PGS004613
pgscatalog.core.lib._normalise: 2024-04-03 09:36:07 WARNING  93 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:36:08 WARNING  Multiple other_alleles detected in 212909 variants
pgscatalog.core.lib._normalise: 2024-04-03 09:36:08 WARNING  Other allele for these variants is set to missing
pgscatalog.core.lib._normalise: 2024-04-03 09:36:08 WARNING  89 of 1059939 variants are duplicated in: PGS004359
pgscatalog.core.lib._normalise: 2024-04-03 09:37:29 WARNING  22 bad variants
pgscatalog.core.lib._normalise: 2024-04-03 09:37:29 WARNING  6 of 30603 variants are duplicated in: PGS002452

It would be interesting to know if it is worth waiting. Some screen pages up I see the following message

1%|██ | 61/4488 [41:14:47<49066:42:55, 39900.65s/it]

Together with the fact that the memory need is only increasing, there is not much hope, is it?

smlmbrt commented 4 months ago

Hmmm, that's written to be memory-light so we'll have to check on that (cc @nebfield). But you can run it on batches of scores and then concatenate the outputs as they'll have the same column order.

smlmbrt commented 4 months ago

You shouldn't have the memory problem in v0.5.3, it looks like you're using v1.

mfasold commented 4 months ago

Yes, I was using v1.0. Thanks for the advice, I am trying version v0.5.3 now. It seemed like a nice performance upgrade first, running to item 65/4488 very quick. But two hours in, at item 78/4488, it seems to slow down and memory is up at 120Gb again.

mfasold commented 4 months ago

But you can run it on batches of scores and then concatenate the outputs as they'll have the same column order.

I tried it with batches of 70 scores. This works with all but one batch (PGS003364 to PGS003433) where it stalls forever.

nebfield commented 4 months ago

Fixed in the new release:

https://github.com/PGScatalog/pygscatalog/pull/15

Thanks for the bug report 🚀

mfasold commented 4 months ago

Is it already possible to use this release in pgsc_calc? AFAICS, it is currently using the docker image ghcr.io/pgscatalog/pgscatalog_utils:v0.4.3

nebfield commented 4 months ago

We're planning to integrate pygscatalog with pgsc_calc in the next release of the calculator.

The release should be fairly soon (within a few weeks). The largest amount of work will be setting up some automatic correlation tests to make sure calculated scores remain stable across releases. We normally do slow manual testing 😅

mfasold commented 4 months ago

Ok, thank you for the info. I was trying to run pgsc_calc with all the scores (stress test?) and it failed after 6h in the COMBINE_SCOREFILES step. So I am hoping that it will be possible with those fixes included in the next release.

mfasold commented 4 months ago

Remember that I had combined scorefiles in batches of 70 before. I want to obtain a list with all locations where there are effect alleles across all studies, to do variant calling on those positions. I find that if I collect those positions from the batched-combined scorefiles, I get 17M positions. If I take them from the one combined scorefile from the new release, I obtain only 2.6 M positions. Is there any reasonable explanation for such a loss in variant locations after combining scores? Unfortunately, I deleted my old parallel processing results.