Ancestry clustering yielding incorrect result

Fiwx commented 2 months ago

Description of the bug

On the most recent dev branch, I have noticed VCF files having incorrect ancestry results. That is, the PC values that are assigned are unusual and aren't near any other value in 1000G, and the ancestry assignment (closest population) is incorrect: the ancestry does not (not even close) to matching the self-reported ancestry of the samples, and previous runs on these samples have resulted in correct ancestry assignments (and normal PC values). I am using the run ancestry flag.

These are imputed VCFs. The match rate to the reference panel is about 1/3, and the match rate to the scorefile is just under 100%. The genome build used is correct.

I'm wondering if there are any ideas that I can explore to fix this or if this behavior has been seen before. Along with the failed ancestry analysis, the Z_norm2 values (which seem normal) are discrepant from the percentile in the most similar population values (which are at 100.0%, abnormally high).

I was able to replicate this across multiple VCFs, though each VCF was submitted one at a time.

Command used and terminal output

No response

Relevant files

No response

System information

Linux. -dev branch of pgsc_calc. Singularity.

nebfield commented 2 months ago

Thanks for the bug report. We did recently change how variants are filtered for PCA. @smlmbrt will investigate after his holidays 🌴

Can I double check if you're deleting the cache each time you use a new VCF?

Fiwx commented 2 months ago

I deleted the work directory and the results directory before running. Does removing the work directory delete the cache? I did not use –genotypes_cache. I did not attempt to do anything with NXF_SINGULARITY_CACHEDIR.

I also tested this on a earlier, but still recent version (I believe v2.0.0-beta), and had the same issue. Congratulations on the preprint, by the way.

smlmbrt commented 1 month ago

@Fiwx, are you applying the pipeline to individual samples? In v2.0.0-beta I added an additional filter for PCA variant inclusion that the MAF > 10% and the Missingness < 10% for variants in the target sample https://github.com/PGScatalog/pgsc_calc/blob/0f33b4cff01036bb0b020ad535c4414f070a4a3c/modules/local/ancestry/intersect_variants.nf#L33-L39

For individual samples reducing the MAF threshold to 0 should revert to the old behaviour. I think it's still sensible to applying the missingness filter to these variants though. It would be great if you could let us know if that fixes it, we may consider some sort of --single_sample flag that would then change these filters depending on how many people the PGS is being calculated for.

Fiwx commented 1 month ago

Yes, I saw those changes, and I also thought they were sensible. I'm not sure why they would cause an issue. By the way, I think the documentation still mentions 5% threshold for MAF: "minor allele frequency [MAF > 5%]."

I am running the pipeline with single samples, yes. Is there a MAF threshold flag/config, or should I just edit --maf_target 0.1 in the source?

Fiwx commented 1 month ago

@smlmbrt The results are biased towards extreme percentile values. Here are some examples from different scores:

            "Z_norm2": 1.9385237149563703,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": 2.644385079008054,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -2.3964278919739876,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": 1.7975167781748878,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -3.0629493759113258,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": -2.5554729425082163,
            "percentile_MostSimilarPop": 0.0

It also shows up in an unusual location in the PCA chart, away from other samples in 1000G.

Run information:

    "run_ancestry": "/home/user/data/pgsc_1000G_v1.tar.zst",
    "geno_ref": 0.1,
    "mind_ref": 0.1,
    "maf_ref": 0.05,
    "hwe_ref": 0.0001,
    "indep_pairwise_ref": "1000 50 0.05",
    "projection_method": "oadp",
    "ancestry_method": "RandomForest",
    "ref_label": "SuperPop",
    "n_popcomp": 5,
    "normalization_method": "empirical mean mean+var",
    "n_normalization": 4,
    "liftover": false,
    "target_build": "GRCh37",
    "min_lift": 0.95,
    "min_overlap": 0.0,

smlmbrt commented 1 month ago

@smlmbrt The results are biased towards extreme percentile values. Here are some examples from different scores:
            "Z_norm2": 1.9385237149563703,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": 2.644385079008054,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -2.3964278919739876,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": 1.7975167781748878,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -3.0629493759113258,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": -2.5554729425082163,
            "percentile_MostSimilarPop": 0.0
Are these the original results or after editing the module? Could you edit to --maf_target 0 --geno_miss 1 or --maf_target 0 --geno_miss 0.1 and re-running without any cacheing to see if that fixes it?

Fiwx commented 1 month ago

This was without changing any parameters. I will try those parameters now.

For maf_target, this will return the minor allele frequency at a position. I see the usefulness (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7077175/). Ancestry adjustment is done on the PC level for the Z_norms, so perhaps a large difference in one PC value could cause a large difference in score. For example, this figure shows large eigenvalue change right above MAF 0.02: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4143691/figure/F1/, and shows a transient jump at around 0.1, where the maf_target is currently set. Others have various stances: https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.12995.

When seeking to exclude only singletons in alignments with missing data (a ubiquitous problem for reduced‐representation library prepara‐ tion methods), it is preferable to filter by the count (rather than frequency) of the minor allele, because variation in the amount of missing data across an alignment will cause a static frequency cut‐ off to remove different SFS classes at different sites https://github.com/cjbattey/LinckBattey2017_MAF_clustering/tree/master

For geno_miss, the PLINK .vmiss documentation is a bit unclear on this. What does geno_miss do? In other words, what counts as a missing dosage? I can think of a few interpretations:

chrom:pos pair being present
position containing no missing genotype information
position containing the exact same genotype
position containing the exact same genotype with the same ref/alt assignment
how samples have missing dosage data for an allele, out of 2 * number of samples

Fiwx commented 1 month ago

I ran it with the line changed to --maf_target 0 in modules/local/ancestry/intersect_variants.nf and got:

Jul-23 16:53:32.703 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 16:58:32.757 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:03:32.814 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:04:16.189 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: COMPLETED; exit: 1; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:04:16.194 [TaskFinalizer-6] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); work-dir=/home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce
  error [nextflow.exception.ProcessFailedException]: Process `PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)` terminated with an error exit status (1)
Jul-23 17:04:16.223 [TaskFinalizer-6] ERROR nextflow.processor.TaskProcessor - Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)'

Caused by:
  Process `PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)` terminated with an error exit status (1)

Command executed:

  pgscatalog-intersect --ref GRCh37_1000G_ALL.pvar.zst         --target GRCh37_file_ALL.pvar.zst         --chrom ALL         --maf_target 0         --geno_miss 0.1         --outdir .         -v

  n_matched=$(sed -n '3p' intersect_counts_ALL.txt)

  if [ $n_matched == "0" ]
  then
      echo "ERROR: No variants in intersection"
      exit 1
  else
      mv matched_variants.txt.gz file_ALL_matched.txt.gz
  fi

  cat <<-END_VERSIONS > versions.yml
  INTERSECT_VARIANTS:
      pgscatalog.match: $(echo $(python -c 'import pgscatalog.match; print(pgscatalog.match.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  pgscatalog.match.cli.intersect_cli: 2024-07-23 17:01:42 INFO     Processed 12000000 TARGET variants

[...]

  pgscatalog.match.cli.intersect_cli: 2024-07-23 17:04:16 INFO     Processed 27904794 TARGET variants
  pgscatalog.match.cli.intersect_cli: 2024-07-23 17:04:16 INFO     Outputting TARGET variants -> target_variants.txt.gz
  Traceback (most recent call last):
    File "/app/pgscatalog.utils/.venv/bin/pgscatalog-intersect", line 8, in <module>
      sys.exit(run_intersect())
               ^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/intersect_cli.py", line 172, in run_intersect
      for v in heapq.merge(
    File "/usr/local/lib/python3.11/heapq.py", line 376, in merge
      h_append([key(value), order * direction, value, next])
                ^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/intersect_cli.py", line 174, in <lambda>
      key=lambda v: (v["CHR:POS:A0:A1"], v["ID_TARGET"], v["REF_TARGET"]),
                     ~^^^^^^^^^^^^^^^^^
  KeyError: 'CHR:POS:A0:A1'
  cp: '.command.out' and '.command.out' are the same file
  cp: '.command.err' and '.command.err' are the same file
  cp: '.command.trace' and '.command.trace' are the same file
[...]
Jul-23 17:04:16.239 [TaskFinalizer-6] INFO  nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Jul-23 17:04:16.240 [Actor Thread 38] ERROR nextflow.Nextflow - ERROR: Matching subworkflow failed
Jul-23 17:04:16.251 [main] DEBUG nextflow.Session - Session await > all processes finished
Jul-23 17:04:16.252 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: local) - terminating tasks monitor poll loop
Jul-23 17:04:16.252 [main] DEBUG nextflow.Session - Session await > all barriers passed
Jul-23 17:04:16.260 [Actor Thread 35] ERROR nextflow.Nextflow - ERROR: No results report written!
Jul-23 17:04:16.268 [Actor Thread 42] ERROR nextflow.Nextflow - ERROR: No scores calculated!
Jul-23 17:04:16.275 [Actor Thread 45] ERROR nextflow.Nextflow - ERROR: Projection subworkflow failed

The logs show this command was ran: pgscatalog-intersect --ref GRCh37_1000G_ALL.pvar.zst --target GRCh37_file_ALL.pvar.zst --chrom ALL --maf_target 0 --geno_miss 0.1 --outdir . -v

The target_variants.txt.gz file only contains a header line with the expected 'CHR:POS:A0:A1' column, with no other lines.

Fiwx commented 1 month ago

I changed it to 0.0 instead of "0" and it ran through. Here are the results:

Before and after the change:	PC	DEFAULT
PC1	23.60	23.66
PC2	-30.01	-30.60
PC3	-0.00	0.22
PC4	8.52	8.74
PC5	0.37	-0.10
PC6	0.30	0.67
PC7	-1.00	0.19
PC8	-7.53	-7.64
PC9	1.29	0.87
PC10	-1.36	-1.48

The random forest ancestry assignment probabilities were a bit more accurate after the change. The main issue of extreme percentile values has remained the same: a large percentage of the scores are very high (100.0) or low (0.0).

Changing geno_miss to 1.0, I got the same PC results, and the same issue with the extreme percentile scores.

Example for various scores, with --maf_target 0.0 and --geno_miss 1.0 set:

Here is an example of a normal result:

PGS002228_hmPOS_GRCh37

SUM value from sample: 1.0793
Z-score for SUM value 1.0793: -0.1270 (compared to 1000G reference)
Percentile for SUM value 1.0793: 45.2610
Z_norm1: 0.1183
Z_norm2: 0.1164
percentile_MostSimilarPop: 52.3006

The below results seem to be off, and happen for most scores:

PGS000758_hmPOS_GRCh37

SUM value from sample: 0.4187
Z-score for SUM value 0.4187: 2.0613 (compared to 1000G reference)
Percentile for SUM value 0.4187: 97.0281
Z_norm1: 1.9390
Z_norm2: 1.7975
percentile_MostSimilarPop: 100.0000

PGS000892_hmPOS_GRCh37

SUM value from sample: -0.1183
Z-score for SUM value -0.1183: -0.5594 (compared to 1000G reference)
Percentile for SUM value -0.1183: 29.5984
Z_norm1: 0.0261
Z_norm2: 0.0244
percentile_MostSimilarPop: 80.2147

PGS000921_hmPOS_GRCh37

SUM value from sample: -5.0703
Z-score for SUM value -5.0703: -1.9863 (compared to 1000G reference)
Percentile for SUM value -5.0703: 2.8112
Z_norm1: -2.2731
Z_norm2: -2.1926
percentile_MostSimilarPop: 0.0000

PGS002308_hmPOS_GRCh37

SUM value from sample: 0.2243
Z-score for SUM value 0.2243: -1.2707 (compared to 1000G reference)
Percentile for SUM value 0.2243: 6.6265
Z_norm1: -3.0208
Z_norm2: -3.0629
percentile_MostSimilarPop: 0.0000

PGS002764_hmPOS_GRCh37

SUM value from sample: 0.1875
Z-score for SUM value 0.1875: 0.3131 (compared to 1000G reference)
Percentile for SUM value 0.1875: 57.9920
Z_norm1: 1.1029
Z_norm2: 1.0425
percentile_MostSimilarPop: 98.4663

PGS002771_hmPOS_GRCh37

SUM value from sample: -0.5765
Z-score for SUM value -0.5765: -1.5752 (compared to 1000G reference)
Percentile for SUM value -0.5765: 7.7510
Z_norm1: -2.5792
Z_norm2: -2.5555
percentile_MostSimilarPop: 0.0000

PGS003725_hmPOS_GRCh37

SUM value from sample: -7.5222
Z-score for SUM value -7.5222: -0.1440 (compared to 1000G reference)
Percentile for SUM value -7.5222: 45.4217
Z_norm1: -0.5261
Z_norm2: -0.5248
percentile_MostSimilarPop: 4.7546

smlmbrt commented 1 month ago

See #343 for parallel discussion. Have fixed the filters so that low MAF variants can still be included and the next update of the calculator will allow for these changes (and should revert back to the pre-beta behaviour if the filters are removed).

nebfield commented 1 month ago

The new parameters are available in v2.0.0-beta.2 🥳

PGScatalog / pgsc_calc