Closed nebfield closed 10 months ago
@AWS-crafter I created a new issue from your comment to help investigate this specific problem you're experiencing
When you run the workflow without the --run_ancestry
parameter, how well do your genomes match the input scoring files? Very low match rates could cause an error like this.
@smlmbrt will be able to help more than me for this specific issue because he wrote the ancestry analysis code 🧙
@AWS-crafter are you by any chance running the pipeline on a single sample?
@smlmbrt Yes, I'm running it on a single sample. I'm just testing for now, so I'm using non-imputed single-sample WGS files. In the future I will only be used imputed data (probably from BEAGLE). I will run without ancestry to determine the match rate and update this comment. A similar non-imputed WGS file, which completed successfully, had this match rate:
“Reference matching summary:" % matched: 6.04
Then, under “Summary” and the sampleset for the WGS file: Match %: 46.9
By any chance, might this happen? I have seen something similar in some other tools (e.g., Michigan Imputation Server).
For running a single sample, an ideal process might be dropping if the site is monomorphic in the sample (i.e. sampleset) and the reference panel.
Hi there, any updates on fixes for this issue?
@kmuenzen are you referring to it working on a single sample or the low-match % when using non-imputed genotypes? I will look into the first one soon.
@smlmbrt the first one. Thanks so much!
@kmuenzen could you share the error you get when ancestry_analysis
fails? I just tried running the pipeline with a single sample and it doesn't seem to fail.
@smlmbrt
Execution cancelled -- Finishing pending tasks before exit
Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)'
Caused by:
Process `PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)` terminated with an error exit status (1)
Command executed:
# TODO: --ref_pcs is a horrible hack to select the first duplicate
ancestry_analysis -d biome-test -r reference --psam GRCh38_1000G_ALL.psam --ref_pcs ref_pcs/1.pcs --target_pcs target_pcs/*.pcs -x
cat <<-END_VERSIONS > versions.yml
ANCESTRY_ANALYSIS:
pgscatalog_utils: $(echo $(python -c 'import pgscatalog_utils; print(pgscatalog_utils.__version__)'))
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
root: 2023-10-20 23:55:42 DEBUG Verbose logging enabled
pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG Reading PCA projection: ref_pcs/1.pcs
pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG Initialising combined DF
pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG Filtering to relevant PCs
pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG Flagging related samples with: [GRCh38_1000G.king.cutoff.out.id](http://grch38_1000g.king.cutoff.out.id/)
pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Reading PCA projection: target_pcs/001.pcs
pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Initialising combined DF
pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Reading PCA projection: target_pcs/002.pcs
pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Appending to combined DF
pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Filtering to relevant PCs
pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Reading aggregated score data: aggregated_scores.txt.gz
Traceback (most recent call last):
File "/venv/bin/ancestry_analysis", line 8, in <module>
sys.exit(ancestry_analysis())
File "/venv/lib/python3.10/site-packages/pgscatalog_utils/ancestry/ancestry_analysis.py", line 42, in ancestry_analysis
ancestry_ref, ancestry_target, compare_info = compare_ancestry(ref_df=reference_df,
File "/venv/lib/python3.10/site-packages/pgscatalog_utils/ancestry/tools.py", line 79, in compare_ancestry
mwu_pc = mannwhitneyu(ref_train_df[col_pc], target_df[col_pc])
File "/venv/lib/python3.10/site-packages/scipy/stats/_axis_nan_policy.py", line 503, in axis_nan_policy_wrapper
res = hypotest_fun_out(*samples, **kwds)
File "/venv/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py", line 460, in mannwhitneyu
_mwu_input_validation(x, y, use_continuity, alternative, axis, method))
File "/venv/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py", line 203, in _mwu_input_validation
raise ValueError('`x` and `y` must be of nonzero size.')
ValueError: `x` and `y` must be of nonzero size.
Work dir:
/sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
ERROR: No results report written!
@smlmbrt @kmuenzen
Execution cancelled -- Finishing pending tasks before exit Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)' Caused by: Process `PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)` terminated with an error exit status (1) Command executed: # TODO: --ref_pcs is a horrible hack to select the first duplicate ancestry_analysis -d biome-test -r reference --psam GRCh38_1000G_ALL.psam --ref_pcs ref_pcs/1.pcs --target_pcs target_pcs/*.pcs -x cat <<-END_VERSIONS > versions.yml ANCESTRY_ANALYSIS: pgscatalog_utils: $(echo $(python -c 'import pgscatalog_utils; print(pgscatalog_utils.__version__)')) END_VERSIONS Command exit status: 1 Command output: (empty) Command error: root: 2023-10-20 23:55:42 DEBUG Verbose logging enabled pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG Reading PCA projection: ref_pcs/1.pcs pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG Initialising combined DF pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG Filtering to relevant PCs pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG Flagging related samples with: [GRCh38_1000G.king.cutoff.out.id](http://grch38_1000g.king.cutoff.out.id/) pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Reading PCA projection: target_pcs/001.pcs pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Initialising combined DF pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Reading PCA projection: target_pcs/002.pcs pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Appending to combined DF pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Filtering to relevant PCs pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG Reading aggregated score data: aggregated_scores.txt.gz Traceback (most recent call last): File "/venv/bin/ancestry_analysis", line 8, in <module> sys.exit(ancestry_analysis()) File "/venv/lib/python3.10/site-packages/pgscatalog_utils/ancestry/ancestry_analysis.py", line 42, in ancestry_analysis ancestry_ref, ancestry_target, compare_info = compare_ancestry(ref_df=reference_df, File "/venv/lib/python3.10/site-packages/pgscatalog_utils/ancestry/tools.py", line 79, in compare_ancestry mwu_pc = mannwhitneyu(ref_train_df[col_pc], target_df[col_pc]) File "/venv/lib/python3.10/site-packages/scipy/stats/_axis_nan_policy.py", line 503, in axis_nan_policy_wrapper res = hypotest_fun_out(*samples, **kwds) File "/venv/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py", line 460, in mannwhitneyu _mwu_input_validation(x, y, use_continuity, alternative, axis, method)) File "/venv/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py", line 203, in _mwu_input_validation raise ValueError('`x` and `y` must be of nonzero size.') ValueError: `x` and `y` must be of nonzero size. Work dir: /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run` ERROR: No results report written!
I am getting the same error. Could you let me know if you resolved this issue?
Could you run:
head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/ref_pcs/1.pcs
head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/001.pcs
head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/002.pcs
gzcat aggregated_scores.txt.gz | head
It would be helpful to see what the files look like. It seems like your files have more than 1 sample, so it may be that the PCA calculation is going wrong and returning some empty dfs.
Sure thing--here you go! Thank you!
[muenzk01@regen2 ~]$ head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/ref_pcs/1.pcs
nny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/001.pcs
head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/002.pcs
gzcat aggregated_scores.txt.gz | headIID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
HG00096 -22.9793 -50.2136 13.6757 18.6205 -1.0675 3.5750 -1.7383 0.5516 0.6388 -0.9387
HG00097 -23.5658 -49.7249 13.1469 17.2915 -0.3716 5.0207 -1.2759 1.1439 1.1458 -5.5456
HG00099 -23.9904 -50.5022 14.3540 17.9357 -2.2576 5.7937 -1.7419 1.8302 -2.7424 0.3508
HG00100 -24.1005 -50.2796 16.1124 18.9870 -0.7569 3.0548 -1.2720 0.6963 -2.0359 1.7524
HG00101 -24.5031 -49.1951 14.4492 17.6531 -0.9851 6.3107 -3.9469 -0.2086 -0.4370 -0.5810
HG00102 -23.4615 -50.5164 13.0669 18.3179 -1.7605 4.8565 -1.5584 -0.7356 2.8225 -0.5260
HG00103 -23.0385 -49.4304 13.3134 18.9738 -0.2883 6.3473 -0.3034 1.1887 -4.3406 0.8370
HG00105 -25.3557 -49.6544 14.9909 17.6121 0.8649 3.7357 -1.1401 0.5968 -4.6389 -2.0594
HG00106 -24.4528 -50.5133 12.4388 16.0958 2.9962 4.7232 -2.8032 2.3974 -1.3294 0.3437
[muenzk01@regen2 ~]$ head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/001.pcs
IID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
XXXXXX53 -7.2719 -36.7665 12.6320 10.1940 1.4798 -7.0528 3.4545 -0.6976 0.2670 1.5669
XXXXXX83 -22.8284 -48.1951 14.8247 17.0373 -0.9858 4.2221 -1.0541 -1.0974 -2.1622 2.0034
XXXXXX07 -15.9913 -16.1299 28.3114 -22.3567 2.8505 -5.6035 0.5499 -0.7286 -0.0712 -0.5644
XXXXXX65 -34.9420 51.6389 7.6477 11.2796 -13.9236 -1.9328 -0.8617 0.8240 -3.8092 -14.6289
XXXXXX82 -19.2125 -41.9998 6.1198 14.6670 3.7215 -17.5066 5.1994 2.5246 -2.6737 -0.6892
XXXXXX12 -18.8852 -43.9067 7.3964 15.7738 2.5338 -16.1719 4.5616 0.3198 -0.2936 -4.5557
XXXXXX59 -22.8107 -49.2593 12.0117 18.6088 0.1927 3.5462 -3.9146 0.1334 1.2415 0.6785
XXXXXX62 -19.9087 -46.6020 7.5153 17.0778 1.6574 -15.3447 7.0460 1.9859 -2.8799 -0.6778
XXXXXX17 -37.1313 53.2683 9.7514 8.8948 -18.3655 -1.8795 1.1195 -0.6213 -2.2004 -9.5400
[muenzk01@regen2 ~]$ head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/002.pcs
IID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
93 69.6788 6.5157 1.0354 0.7411 -0.4039 1.0143 0.5329 2.0902 -0.8908 1.0005
X19 62.9088 4.9745 1.5477 -0.1809 -1.5857 1.3580 -1.3621 4.0277 -0.4403 2.2139
X27 -4.8831 -27.7794 17.4015 -2.6830 3.7731 -12.7259 1.5784 -1.1199 -1.4060 -2.7215
X15 16.1620 -16.3076 13.3903 -2.2331 2.9216 -5.5014 0.3389 1.6810 1.9261 1.2277
X54 12.8512 -17.1292 16.6862 -5.0830 3.2757 -5.4289 -1.9472 1.2895 -0.3060 3.2039
X80 -9.1626 -26.6734 20.0412 -4.5341 2.8686 -10.4721 1.3198 -1.4688 2.3520 1.7657
X91 -13.0506 -31.2799 18.3390 -2.4302 3.9486 -10.5128 4.4644 0.3817 4.5359 -1.1057
X07 46.3433 -3.3335 6.7906 -1.0008 0.9525 -2.5622 -0.4529 0.0768 -0.8838 -0.0298
X52 17.6063 -17.0715 13.7085 -3.1210 1.6424 -4.3012 1.5122 1.3648 -2.0695 0.2564
[muenzk01@regen2 1697304ec1a7ea0faa9f7eab4fc27d]$ zcat aggregated_scores.txt.gz | head
sampleset IID DENOM PGS003197_hmPOS_GRCh38_SUM PGS003197_hmPOS_GRCh38_AVG
biome-test 93 16229154.0 -0.0451909 -2.7845505686864514e-09
biome-test X19 16229154.0 -0.103141 -6.355291224668889e-09
biome-test X27 16229154.0 -0.37267 -2.2962996099488613e-08
biome-test X15 16229154.0 -0.394929 -2.4334540173813124e-08
biome-test X54 16229154.0 -0.502914 -3.098830659934584e-08
biome-test X80 16229154.0 -0.281765 -1.7361656682782108e-08
biome-test X91 16229154.0 -0.27768 -1.710994916925429e-08
biome-test X07 16229154.0 -0.246382 -1.5181444454837262e-08
biome-test X52 16229154.0 -0.61067 -3.762796261591948e-08
@kmuenzen - do the sample IDs in the target PCs look right to you? Thinking of XXXXXX53
vs 93
vs. X19
.
I finally was able to reproduce this bug - it happens when all the IDs in the psam are numeric!
$ cat numeric_OCE.psam | head
#IID SEX population latitude longitude region
655 1 Bougainville -6 155 OCEANIA
$ cat target_pcs/001.pcs | head
IID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
655 -20.6509 28.8798 -18.9365 -0.6973 -1.0790 0.0627 -0.8486 2.1669 -14.1303 -8.9729
$ gzcat aggregated_scores.txt.gz | head
sampleset IID DENOM PGS000004_hmPOS_GRCh38_SUM PGS000018_hmPOS_GRCh38_SUM PGS000027_hmPOS_GRCh38_SUM PGS000036_hmPOS_GRCh38_SUM PGS000065_hmPOS_GRCh38_SUM PGS000889_hmPOS_GRCh38_SUM PGS003436_hmPOS_GRCh38_SUM PGS000004_hmPOS_GRCh38_AVG PGS000018_hmPOS_GRCh38_AVG PGS000027_hmPOS_GRCh38_AVG PGS000036_hmPOS_GRCh38_AVG PGS000065_hmPOS_GRCh38_AVG PGS000889_hmPOS_GRCh38_AVG PGS003436_hmPOS_GRCh38_AVG
HGDP 655 7300910.0 -0.93377 0.41891999999999996 38.78698 -2359.2295599999998 -0.11405929999999999 41.225443 4.27355 -1.2789775521133666e-07 5.7379148626678035e-08 5.312622673064043e-06 -0.0003231418494406861 -1.5622614167275037e-08 5.646617065543884e-06 5.853448405746681e-07
reference HG00096 7300910.0 -0.47219999999999995 -0.3971499999999999 38.28005 -2272.516 -0.3333011 39.4819 4.93062 -6.467686904783102e-08 -5.4397328552194165e-08 5.243188862758204e-06 -0.0003112647601463379 -4.565199406649308e-08 5.407805328376874e-06 6.753432106408653e-07
Fix handling of numeric-only IIDs in:
pgscatalog_utils/combine_scorefiles
pgscatalog_utils/ancestry_analysis
pgscatalog_utils/aggregate_scores
fraposa_pgsc
If people would like to use the pipeline in the meantime I suggest adding a leading or trailing text character to your sample IDs.
@smlmbrt I masked the IDs greater than 2 digits long, so that makes a lot of sense! Thank you so much for looking into this!
@smlmbrt Thank you so much. It worked successfully for me when I changed the only numeric IDs :)
@kmuenzen - thanks for the clarification, still it helped debug the problem so really useful! If you change the IIDs to have a character at the start it should fix it in the interim (we will make a patch soon).
@gayuk14 thanks for testing/clarifying that fixes the problem on your side as well.
.nextflow.log:
This was in pfile format on build hg19. It seems like the Mann–Whitney U test function within the pgscatalog_utils.ancestry.tools.compare_ancestry function is being called with empty datasets for some reason (target_df[col_pc] is empty / ref_train_df or target_df exist but are empty).
Adding some ancestry option might fix this. I am okay with using the closest ancestry group as opposed to the PC regression method, or whatever other methods would fix this / skip the test. Changing normalization_method to either mean or empirical results in the same error with mannwhitneyu.
_Originally posted by @AWS-crafter in https://github.com/PGScatalog/pgsc_calc/issues/175#issuecomment-1736455168_