PGScatalog / pgsc_calc

The Polygenic Score Catalog Calculator is a nextflow pipeline for polygenic score calculation
https://pgsc-calc.readthedocs.io/en/latest/
Apache License 2.0
113 stars 21 forks source link

Ancestry analysis error (when only numeric IIDs are present) #177

Closed nebfield closed 10 months ago

nebfield commented 1 year ago
          I got the same error, then I added the -c custom.config with processs_low to use 16 GB of RAM. After doing that, there was a different error:
[everything going as expected...]
[90/3b9c76] process > PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)                                               [  0%] 0 of 1
[-        ] process > PGSCATALOG_PGSCALC:PGSCALC:REPORT:SCORE_REPORT                                                        -
[4b/ab791b] process > PGSCATALOG_PGSCALC:PGSCALC:DUMPSOFTWAREVERSIONS (1)                                                   [  0%] 0 of 1
ERROR ~ Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)'

Caused by:
  Process `PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)` terminated with an error exit status (1)

Command executed:

  # TODO: --ref_pcs is a horrible hack to select the first duplicate
  ancestry_analysis -d dante         -r reference         --psam GRCh37_1000G_ALL.psam         --ref_pcs ref_pcs/1.pcs         --target_pcs target_pcs/*.pcs         -x deg2_phase3.king.cutoff.out.id         -p SuperPop         -s aggregated_scores.txt.gz         -a RandomForest         --n_popcomp 5         -n empirical mean mean+var         --n_normalization 4         --outdir .         -v

  cat <<-END_VERSIONS > versions.yml
  ANCESTRY_ANALYSIS:
      pgscatalog_utils: $(echo $(python -c 'import pgscatalog_utils; print(pgscatalog_utils.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)
executor >  local (18)
[c4/156ceb] process > PGSCATALOG_PGSCALC:PGSCALC:DOWNLOAD_SCOREFILES ([pgs_id:PGS003725, pgp_id:, trait_efo:])              [100%] 1 of 1 ✔
[49/3e1a68] process > PGSCATALOG_PGSCALC:PGSCALC:INPUT_CHECK:SAMPLESHEET_JSON (dante_samplesheet_pfile.csv)                 [100%] 1 of 1 ✔
[db/c88021] process > PGSCATALOG_PGSCALC:PGSCALC:INPUT_CHECK:COMBINE_SCOREFILES (1)                                         [100%] 1 of 1 ✔
[-        ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_RELABELBIM                                          -
[skipped  ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_RELABELPVAR (dante chromosome ALL)                  [100%] 1 of 1, stored: 1 ✔
[-        ] process > PGSCATALOG_PGSCALC:PGSCALC:MAKE_COMPATIBLE:PLINK2_VCF                                                 -
[skipped  ] process > PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:EXTRACT_DATABASE (1)                                      [100%] 1 of 1, stored: 1 ✔
[skipped  ] process > PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (dante chromosome ALL)                 [100%] 1 of 1, stored: 1 ✔
[d4/cece30] process > PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:FILTER_VARIANTS (dante GRCh37)                            [100%] 1 of 1 ✔
[56/61d6f1] process > PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:PLINK2_MAKEBED_REF (reference chromosome)                 [100%] 1 of 1 ✔
[skipped  ] process > PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:INTERSECT_THINNED (dante)                                 [100%] 1 of 1, stored: 1 ✔
[b0/970c2d] process > PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:RELABEL_IDS (dante pvar)                                  [100%] 1 of 1 ✔
[8d/f0516c] process > PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:PLINK2_MAKEBED_TARGET (dante chromosome)                  [100%] 1 of 1 ✔
[19/5711ab] process > PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:PLINK2_ORIENT (dante)                                     [100%] 1 of 1 ✔
[a4/64eb37] process > PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:FRAPOSA_PCA (reference)                                   [100%] 1 of 1 ✔
[skipped  ] process > PGSCATALOG_PGSCALC:PGSCALC:ANCESTRY_PROJECT:FRAPOSA_PROJECT (dante)                                   [100%] 1 of 1, stored: 1 ✔
[5f/cc49c9] process > PGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_VARIANTS (dante chromosome ALL)                                [100%] 1 of 1 ✔
[b7/749c29] process > PGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_COMBINE (dante)                                                [100%] 1 of 1 ✔
[e2/98946a] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:RELABEL_SCOREFILE_IDS (dante scorefile)                        [100%] 1 of 1 ✔
[3e/764947] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:RELABEL_AFREQ_IDS (dante afreq)                                [100%] 1 of 1 ✔
[3a/58afd3] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:PLINK2_SCORE (reference chromosome ALL effect type additive 0) [100%] 2 of 2 ✔
[f9/4ee72c] process > PGSCATALOG_PGSCALC:PGSCALC:APPLY_SCORE:SCORE_AGGREGATE (dante)                                        [100%] 1 of 1 ✔
[90/3b9c76] process > PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)                                               [100%] 1 of 1, failed: 1 ✘
[-        ] process > PGSCATALOG_PGSCALC:PGSCALC:REPORT:SCORE_REPORT                                                        -
[4b/ab791b] process > PGSCATALOG_PGSCALC:PGSCALC:DUMPSOFTWAREVERSIONS (1)                                                   [  0%] 0 of 1
Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)'

Caused by:
  Process `PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)` terminated with an error exit status (1)

Command executed:

  # TODO: --ref_pcs is a horrible hack to select the first duplicate
  ancestry_analysis -d dante         -r reference         --psam GRCh37_1000G_ALL.psam         --ref_pcs ref_pcs/1.pcs         --target_pcs target_pcs/*.pcs         -x deg2_phase3.king.cutoff.out.id         -p SuperPop         -s aggregated_scores.txt.gz         -a RandomForest         --n_popcomp 5         -n empirical mean mean+var         --n_normalization 4         --outdir .         -v

  cat <<-END_VERSIONS > versions.yml
  ANCESTRY_ANALYSIS:
      pgscatalog_utils: $(echo $(python -c 'import pgscatalog_utils; print(pgscatalog_utils.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  root: 2023-09-26 23:33:09 DEBUG    Verbose logging enabled
  pgscatalog_utils.ancestry.read: 2023-09-26 23:33:09 DEBUG    Reading PCA projection: ref_pcs/1.pcs
  pgscatalog_utils.ancestry.read: 2023-09-26 23:33:09 DEBUG    Initialising combined DF
  pgscatalog_utils.ancestry.read: 2023-09-26 23:33:09 DEBUG    Filtering to relevant PCs
  pgscatalog_utils.ancestry.read: 2023-09-26 23:33:09 DEBUG    Flagging related samples with: deg2_phase3.king.cutoff.out.id
  pgscatalog_utils.ancestry.read: 2023-09-26 23:33:09 DEBUG    Reading PCA projection: target_pcs/001.pcs
  pgscatalog_utils.ancestry.read: 2023-09-26 23:33:09 DEBUG    Initialising combined DF
  pgscatalog_utils.ancestry.read: 2023-09-26 23:33:09 DEBUG    Filtering to relevant PCs
  pgscatalog_utils.ancestry.read: 2023-09-26 23:33:09 DEBUG    Reading aggregated score data: aggregated_scores.txt.gz
  Traceback (most recent call last):
    File "/venv/bin/ancestry_analysis", line 8, in <module>
      sys.exit(ancestry_analysis())
    File "/venv/lib/python3.10/site-packages/pgscatalog_utils/ancestry/ancestry_analysis.py", line 42, in ancestry_analysis
      ancestry_ref, ancestry_target, compare_info = compare_ancestry(ref_df=reference_df,
    File "/venv/lib/python3.10/site-packages/pgscatalog_utils/ancestry/tools.py", line 79, in compare_ancestry
      mwu_pc = mannwhitneyu(ref_train_df[col_pc], target_df[col_pc])
    File "/venv/lib/python3.10/site-packages/scipy/stats/_axis_nan_policy.py", line 503, in axis_nan_policy_wrapper
      res = hypotest_fun_out(*samples, **kwds)
    File "/venv/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py", line 460, in mannwhitneyu
      _mwu_input_validation(x, y, use_continuity, alternative, axis, method))
    File "/venv/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py", line 203, in _mwu_input_validation
      raise ValueError('`x` and `y` must be of nonzero size.')
  ValueError: `x` and `y` must be of nonzero size.

Work dir:
  /home/ubuntu/pgsc_calc/work/90/3b9c761ac0ba7ebb4b64aa470c7685

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details
ERROR ~ ERROR: No results report written!

 -- Check '.nextflow.log' file for details

.nextflow.log:

[...]
Sep-26 23:33:09.777 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 23; name: PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1); status: COMPLETED; exit: 1; error: -; workDir: /home/ubuntu/pgsc_calc/work/90/3b9c761ac0ba7ebb4b64aa470c7685]
Sep-26 23:33:09.780 [Task submitter] DEBUG n.executor.local.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Sep-26 23:33:09.780 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1); work-dir=/home/ubuntu/pgsc_calc/work/90/3b9c761ac0ba7ebb4b64aa470c7685
  error [nextflow.exception.ProcessFailedException]: Process `PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)` terminated with an error exit status (1)
Sep-26 23:33:09.780 [Task submitter] INFO  nextflow.Session - [4b/ab791b] Submitted process > PGSCATALOG_PGSCALC:PGSCALC:DUMPSOFTWAREVERSIONS (1)
Sep-26 23:33:09.791 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)'

Caused by:
  Process `PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)` terminated with an error exit status (1)
[...]

This was in pfile format on build hg19. It seems like the Mann–Whitney U test function within the pgscatalog_utils.ancestry.tools.compare_ancestry function is being called with empty datasets for some reason (target_df[col_pc] is empty / ref_train_df or target_df exist but are empty).

Adding some ancestry option might fix this. I am okay with using the closest ancestry group as opposed to the PC regression method, or whatever other methods would fix this / skip the test. Changing normalization_method to either mean or empirical results in the same error with mannwhitneyu.

_Originally posted by @AWS-crafter in https://github.com/PGScatalog/pgsc_calc/issues/175#issuecomment-1736455168_

nebfield commented 1 year ago

@AWS-crafter I created a new issue from your comment to help investigate this specific problem you're experiencing

When you run the workflow without the --run_ancestry parameter, how well do your genomes match the input scoring files? Very low match rates could cause an error like this.

@smlmbrt will be able to help more than me for this specific issue because he wrote the ancestry analysis code 🧙

smlmbrt commented 1 year ago

@AWS-crafter are you by any chance running the pipeline on a single sample?

AWS-crafter commented 1 year ago

@smlmbrt Yes, I'm running it on a single sample. I'm just testing for now, so I'm using non-imputed single-sample WGS files. In the future I will only be used imputed data (probably from BEAGLE). I will run without ancestry to determine the match rate and update this comment. A similar non-imputed WGS file, which completed successfully, had this match rate:

“Reference matching summary:" % matched: 6.04

Then, under “Summary” and the sampleset for the WGS file: Match %: 46.9

By any chance, might this happen? I have seen something similar in some other tools (e.g., Michigan Imputation Server).

  1. The pipeline looks at the genotype files indicated by the sampleset
  2. For that sampleset, for each position, it checks to see if all files have the same variant at that position (e.g., if everyone out of 150 people has genotype 0/0 at the position). If so (if site is monomorphic within sampleset), it drops this position or otherwise does not use it
  3. For a single-sample genotype file, every homozygous site is of course invariant, so it gets dropped

For running a single sample, an ideal process might be dropping if the site is monomorphic in the sample (i.e. sampleset) and the reference panel.

kmuenzen commented 11 months ago

Hi there, any updates on fixes for this issue?

smlmbrt commented 11 months ago

@kmuenzen are you referring to it working on a single sample or the low-match % when using non-imputed genotypes? I will look into the first one soon.

kmuenzen commented 11 months ago

@smlmbrt the first one. Thanks so much!

smlmbrt commented 11 months ago

@kmuenzen could you share the error you get when ancestry_analysis fails? I just tried running the pipeline with a single sample and it doesn't seem to fail.

kmuenzen commented 11 months ago

@smlmbrt

Execution cancelled -- Finishing pending tasks before exit
Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)'

Caused by:
  Process `PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)` terminated with an error exit status (1)

Command executed:

  # TODO: --ref_pcs is a horrible hack to select the first duplicate
  ancestry_analysis -d biome-test         -r reference         --psam GRCh38_1000G_ALL.psam         --ref_pcs ref_pcs/1.pcs         --target_pcs target_pcs/*.pcs         -x

  cat <<-END_VERSIONS > versions.yml
  ANCESTRY_ANALYSIS:
      pgscatalog_utils: $(echo $(python -c 'import pgscatalog_utils; print(pgscatalog_utils.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  root: 2023-10-20 23:55:42 DEBUG    Verbose logging enabled
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG    Reading PCA projection: ref_pcs/1.pcs
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG    Initialising combined DF
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG    Filtering to relevant PCs
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG    Flagging related samples with: [GRCh38_1000G.king.cutoff.out.id](http://grch38_1000g.king.cutoff.out.id/)
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Reading PCA projection: target_pcs/001.pcs
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Initialising combined DF
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Reading PCA projection: target_pcs/002.pcs
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Appending to combined DF
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Filtering to relevant PCs
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Reading aggregated score data: aggregated_scores.txt.gz
  Traceback (most recent call last):
    File "/venv/bin/ancestry_analysis", line 8, in <module>
      sys.exit(ancestry_analysis())
    File "/venv/lib/python3.10/site-packages/pgscatalog_utils/ancestry/ancestry_analysis.py", line 42, in ancestry_analysis
      ancestry_ref, ancestry_target, compare_info = compare_ancestry(ref_df=reference_df,
    File "/venv/lib/python3.10/site-packages/pgscatalog_utils/ancestry/tools.py", line 79, in compare_ancestry
      mwu_pc = mannwhitneyu(ref_train_df[col_pc], target_df[col_pc])
    File "/venv/lib/python3.10/site-packages/scipy/stats/_axis_nan_policy.py", line 503, in axis_nan_policy_wrapper
      res = hypotest_fun_out(*samples, **kwds)
    File "/venv/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py", line 460, in mannwhitneyu
      _mwu_input_validation(x, y, use_continuity, alternative, axis, method))
    File "/venv/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py", line 203, in _mwu_input_validation
      raise ValueError('`x` and `y` must be of nonzero size.')
  ValueError: `x` and `y` must be of nonzero size.

Work dir:
  /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

ERROR: No results report written!
gayuk14 commented 11 months ago

@smlmbrt @kmuenzen

Execution cancelled -- Finishing pending tasks before exit
Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)'

Caused by:
  Process `PGSCATALOG_PGSCALC:PGSCALC:REPORT:ANCESTRY_ANALYSIS (1)` terminated with an error exit status (1)

Command executed:

  # TODO: --ref_pcs is a horrible hack to select the first duplicate
  ancestry_analysis -d biome-test         -r reference         --psam GRCh38_1000G_ALL.psam         --ref_pcs ref_pcs/1.pcs         --target_pcs target_pcs/*.pcs         -x

  cat <<-END_VERSIONS > versions.yml
  ANCESTRY_ANALYSIS:
      pgscatalog_utils: $(echo $(python -c 'import pgscatalog_utils; print(pgscatalog_utils.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  root: 2023-10-20 23:55:42 DEBUG    Verbose logging enabled
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG    Reading PCA projection: ref_pcs/1.pcs
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG    Initialising combined DF
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG    Filtering to relevant PCs
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:42 DEBUG    Flagging related samples with: [GRCh38_1000G.king.cutoff.out.id](http://grch38_1000g.king.cutoff.out.id/)
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Reading PCA projection: target_pcs/001.pcs
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Initialising combined DF
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Reading PCA projection: target_pcs/002.pcs
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Appending to combined DF
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Filtering to relevant PCs
  pgscatalog_utils.ancestry.read: 2023-10-20 23:55:43 DEBUG    Reading aggregated score data: aggregated_scores.txt.gz
  Traceback (most recent call last):
    File "/venv/bin/ancestry_analysis", line 8, in <module>
      sys.exit(ancestry_analysis())
    File "/venv/lib/python3.10/site-packages/pgscatalog_utils/ancestry/ancestry_analysis.py", line 42, in ancestry_analysis
      ancestry_ref, ancestry_target, compare_info = compare_ancestry(ref_df=reference_df,
    File "/venv/lib/python3.10/site-packages/pgscatalog_utils/ancestry/tools.py", line 79, in compare_ancestry
      mwu_pc = mannwhitneyu(ref_train_df[col_pc], target_df[col_pc])
    File "/venv/lib/python3.10/site-packages/scipy/stats/_axis_nan_policy.py", line 503, in axis_nan_policy_wrapper
      res = hypotest_fun_out(*samples, **kwds)
    File "/venv/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py", line 460, in mannwhitneyu
      _mwu_input_validation(x, y, use_continuity, alternative, axis, method))
    File "/venv/lib/python3.10/site-packages/scipy/stats/_mannwhitneyu.py", line 203, in _mwu_input_validation
      raise ValueError('`x` and `y` must be of nonzero size.')
  ValueError: `x` and `y` must be of nonzero size.

Work dir:
  /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

ERROR: No results report written!

I am getting the same error. Could you let me know if you resolved this issue?

smlmbrt commented 11 months ago

Could you run:

head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/ref_pcs/1.pcs
head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/001.pcs
head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/002.pcs
gzcat aggregated_scores.txt.gz | head

It would be helpful to see what the files look like. It seems like your files have more than 1 sample, so it may be that the PCA calculation is going wrong and returning some empty dfs.

kmuenzen commented 11 months ago

Sure thing--here you go! Thank you!

[muenzk01@regen2 ~]$ head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/ref_pcs/1.pcs
nny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/001.pcs
head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/002.pcs
gzcat aggregated_scores.txt.gz | headIID        PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9     PC10
HG00096 -22.9793        -50.2136        13.6757 18.6205 -1.0675 3.5750  -1.7383 0.5516  0.6388  -0.9387
HG00097 -23.5658        -49.7249        13.1469 17.2915 -0.3716 5.0207  -1.2759 1.1439  1.1458  -5.5456
HG00099 -23.9904        -50.5022        14.3540 17.9357 -2.2576 5.7937  -1.7419 1.8302  -2.7424 0.3508
HG00100 -24.1005        -50.2796        16.1124 18.9870 -0.7569 3.0548  -1.2720 0.6963  -2.0359 1.7524
HG00101 -24.5031        -49.1951        14.4492 17.6531 -0.9851 6.3107  -3.9469 -0.2086 -0.4370 -0.5810
HG00102 -23.4615        -50.5164        13.0669 18.3179 -1.7605 4.8565  -1.5584 -0.7356 2.8225  -0.5260
HG00103 -23.0385        -49.4304        13.3134 18.9738 -0.2883 6.3473  -0.3034 1.1887  -4.3406 0.8370
HG00105 -25.3557        -49.6544        14.9909 17.6121 0.8649  3.7357  -1.1401 0.5968  -4.6389 -2.0594
HG00106 -24.4528        -50.5133        12.4388 16.0958 2.9962  4.7232  -2.8032 2.3974  -1.3294 0.3437
[muenzk01@regen2 ~]$ head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/001.pcs
IID     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9     PC10
XXXXXX53        -7.2719 -36.7665        12.6320 10.1940 1.4798  -7.0528 3.4545  -0.6976 0.2670  1.5669
XXXXXX83        -22.8284        -48.1951        14.8247 17.0373 -0.9858 4.2221  -1.0541 -1.0974 -2.1622 2.0034
XXXXXX07        -15.9913        -16.1299        28.3114 -22.3567        2.8505  -5.6035 0.5499  -0.7286 -0.0712 -0.5644
XXXXXX65        -34.9420        51.6389 7.6477  11.2796 -13.9236        -1.9328 -0.8617 0.8240  -3.8092 -14.6289
XXXXXX82        -19.2125        -41.9998        6.1198  14.6670 3.7215  -17.5066        5.1994  2.5246  -2.6737 -0.6892
XXXXXX12        -18.8852        -43.9067        7.3964  15.7738 2.5338  -16.1719        4.5616  0.3198  -0.2936 -4.5557
XXXXXX59        -22.8107        -49.2593        12.0117 18.6088 0.1927  3.5462  -3.9146 0.1334  1.2415  0.6785
XXXXXX62        -19.9087        -46.6020        7.5153  17.0778 1.6574  -15.3447        7.0460  1.9859  -2.8799 -0.6778
XXXXXX17        -37.1313        53.2683 9.7514  8.8948  -18.3655        -1.8795 1.1195  -0.6213 -2.2004 -9.5400
[muenzk01@regen2 ~]$ head /sc/arion/projects/kennylab/travis/kenny/ctg/pgsc/work/ad/1697304ec1a7ea0faa9f7eab4fc27d/target_pcs/002.pcs
IID     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9     PC10
93      69.6788 6.5157  1.0354  0.7411  -0.4039 1.0143  0.5329  2.0902  -0.8908 1.0005
X19     62.9088 4.9745  1.5477  -0.1809 -1.5857 1.3580  -1.3621 4.0277  -0.4403 2.2139
X27     -4.8831 -27.7794        17.4015 -2.6830 3.7731  -12.7259        1.5784  -1.1199 -1.4060 -2.7215
X15     16.1620 -16.3076        13.3903 -2.2331 2.9216  -5.5014 0.3389  1.6810  1.9261  1.2277
X54     12.8512 -17.1292        16.6862 -5.0830 3.2757  -5.4289 -1.9472 1.2895  -0.3060 3.2039
X80     -9.1626 -26.6734        20.0412 -4.5341 2.8686  -10.4721        1.3198  -1.4688 2.3520  1.7657
X91     -13.0506        -31.2799        18.3390 -2.4302 3.9486  -10.5128        4.4644  0.3817  4.5359  -1.1057
X07     46.3433 -3.3335 6.7906  -1.0008 0.9525  -2.5622 -0.4529 0.0768  -0.8838 -0.0298
X52     17.6063 -17.0715        13.7085 -3.1210 1.6424  -4.3012 1.5122  1.3648  -2.0695 0.2564
[muenzk01@regen2 1697304ec1a7ea0faa9f7eab4fc27d]$ zcat aggregated_scores.txt.gz | head
sampleset       IID     DENOM   PGS003197_hmPOS_GRCh38_SUM      PGS003197_hmPOS_GRCh38_AVG
biome-test      93      16229154.0      -0.0451909      -2.7845505686864514e-09
biome-test      X19     16229154.0      -0.103141       -6.355291224668889e-09
biome-test      X27     16229154.0      -0.37267        -2.2962996099488613e-08
biome-test      X15     16229154.0      -0.394929       -2.4334540173813124e-08
biome-test      X54     16229154.0      -0.502914       -3.098830659934584e-08
biome-test      X80     16229154.0      -0.281765       -1.7361656682782108e-08
biome-test      X91     16229154.0      -0.27768        -1.710994916925429e-08
biome-test      X07     16229154.0      -0.246382       -1.5181444454837262e-08
biome-test      X52     16229154.0      -0.61067        -3.762796261591948e-08
smlmbrt commented 11 months ago

@kmuenzen - do the sample IDs in the target PCs look right to you? Thinking of XXXXXX53 vs 93 vs. X19.

smlmbrt commented 11 months ago

I finally was able to reproduce this bug - it happens when all the IDs in the psam are numeric!

$ cat numeric_OCE.psam | head
#IID    SEX     population      latitude        longitude       region
655     1       Bougainville    -6      155     OCEANIA
$ cat target_pcs/001.pcs | head
IID     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9     PC10
655     -20.6509        28.8798 -18.9365        -0.6973 -1.0790 0.0627  -0.8486 2.1669  -14.1303        -8.9729
$ gzcat aggregated_scores.txt.gz | head
sampleset       IID     DENOM   PGS000004_hmPOS_GRCh38_SUM      PGS000018_hmPOS_GRCh38_SUM      PGS000027_hmPOS_GRCh38_SUM      PGS000036_hmPOS_GRCh38_SUM      PGS000065_hmPOS_GRCh38_SUM   PGS000889_hmPOS_GRCh38_SUM      PGS003436_hmPOS_GRCh38_SUM      PGS000004_hmPOS_GRCh38_AVG      PGS000018_hmPOS_GRCh38_AVG      PGS000027_hmPOS_GRCh38_AVG      PGS000036_hmPOS_GRCh38_AVG   PGS000065_hmPOS_GRCh38_AVG      PGS000889_hmPOS_GRCh38_AVG      PGS003436_hmPOS_GRCh38_AVG
HGDP    655     7300910.0       -0.93377        0.41891999999999996     38.78698        -2359.2295599999998     -0.11405929999999999    41.225443       4.27355 -1.2789775521133666e-07      5.7379148626678035e-08  5.312622673064043e-06   -0.0003231418494406861  -1.5622614167275037e-08 5.646617065543884e-06   5.853448405746681e-07
reference       HG00096 7300910.0       -0.47219999999999995    -0.3971499999999999     38.28005        -2272.516       -0.3333011      39.4819 4.93062 -6.467686904783102e-08  -5.4397328552194165e-08      5.243188862758204e-06   -0.0003112647601463379  -4.565199406649308e-08  5.407805328376874e-06   6.753432106408653e-07

Fix handling of numeric-only IIDs in:

If people would like to use the pipeline in the meantime I suggest adding a leading or trailing text character to your sample IDs.

kmuenzen commented 11 months ago

@smlmbrt I masked the IDs greater than 2 digits long, so that makes a lot of sense! Thank you so much for looking into this!

gayuk14 commented 11 months ago

@smlmbrt Thank you so much. It worked successfully for me when I changed the only numeric IDs :)

smlmbrt commented 11 months ago

@kmuenzen - thanks for the clarification, still it helped debug the problem so really useful! If you change the IIDs to have a character at the start it should fix it in the interim (we will make a patch soon).

@gayuk14 thanks for testing/clarifying that fixes the problem on your side as well.