Closed Fiwx closed 1 month ago
Thanks for the bug report. We did recently change how variants are filtered for PCA. @smlmbrt will investigate after his holidays 🌴
Can I double check if you're deleting the cache each time you use a new VCF?
I deleted the work directory and the results directory before running. Does removing the work directory delete the cache? I did not use –genotypes_cache. I did not attempt to do anything with NXF_SINGULARITY_CACHEDIR.
I also tested this on a earlier, but still recent version (I believe v2.0.0-beta), and had the same issue. Congratulations on the preprint, by the way.
@Fiwx, are you applying the pipeline to individual samples? In v2.0.0-beta
I added an additional filter for PCA variant inclusion that the MAF > 10% and the Missingness < 10% for variants in the target sample https://github.com/PGScatalog/pgsc_calc/blob/0f33b4cff01036bb0b020ad535c4414f070a4a3c/modules/local/ancestry/intersect_variants.nf#L33-L39
For individual samples reducing the MAF threshold to 0 should revert to the old behaviour. I think it's still sensible to applying the missingness filter to these variants though. It would be great if you could let us know if that fixes it, we may consider some sort of --single_sample
flag that would then change these filters depending on how many people the PGS is being calculated for.
Yes, I saw those changes, and I also thought they were sensible. I'm not sure why they would cause an issue. By the way, I think the documentation still mentions 5% threshold for MAF: "minor allele frequency [MAF > 5%]."
I am running the pipeline with single samples, yes. Is there a MAF threshold flag/config, or should I just edit --maf_target 0.1 in the source?
@smlmbrt The results are biased towards extreme percentile values. Here are some examples from different scores:
"Z_norm2": 1.9385237149563703,
"percentile_MostSimilarPop": 100.0
"Z_norm2": 2.644385079008054,
"percentile_MostSimilarPop": 100.0
"Z_norm2": -2.3964278919739876,
"percentile_MostSimilarPop": 0.0
"Z_norm2": 1.7975167781748878,
"percentile_MostSimilarPop": 100.0
"Z_norm2": -3.0629493759113258,
"percentile_MostSimilarPop": 0.0
"Z_norm2": -2.5554729425082163,
"percentile_MostSimilarPop": 0.0
It also shows up in an unusual location in the PCA chart, away from other samples in 1000G.
Run information:
"run_ancestry": "/home/user/data/pgsc_1000G_v1.tar.zst",
"geno_ref": 0.1,
"mind_ref": 0.1,
"maf_ref": 0.05,
"hwe_ref": 0.0001,
"indep_pairwise_ref": "1000 50 0.05",
"projection_method": "oadp",
"ancestry_method": "RandomForest",
"ref_label": "SuperPop",
"n_popcomp": 5,
"normalization_method": "empirical mean mean+var",
"n_normalization": 4,
"liftover": false,
"target_build": "GRCh37",
"min_lift": 0.95,
"min_overlap": 0.0,
@smlmbrt The results are biased towards extreme percentile values. Here are some examples from different scores:
"Z_norm2": 1.9385237149563703, "percentile_MostSimilarPop": 100.0 "Z_norm2": 2.644385079008054, "percentile_MostSimilarPop": 100.0 "Z_norm2": -2.3964278919739876, "percentile_MostSimilarPop": 0.0 "Z_norm2": 1.7975167781748878, "percentile_MostSimilarPop": 100.0 "Z_norm2": -3.0629493759113258, "percentile_MostSimilarPop": 0.0 "Z_norm2": -2.5554729425082163, "percentile_MostSimilarPop": 0.0
Are these the original results or after editing the module? Could you edit to
--maf_target 0 --geno_miss 1
or--maf_target 0 --geno_miss 0.1
and re-running without any cacheing to see if that fixes it?
This was without changing any parameters. I will try those parameters now.
For maf_target, this will return the minor allele frequency at a position. I see the usefulness (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7077175/). Ancestry adjustment is done on the PC level for the Z_norms, so perhaps a large difference in one PC value could cause a large difference in score. For example, this figure shows large eigenvalue change right above MAF 0.02: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4143691/figure/F1/, and shows a transient jump at around 0.1, where the maf_target is currently set. Others have various stances: https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.12995.
When seeking to exclude only singletons in alignments with missing data (a ubiquitous problem for reduced‐representation library prepara‐ tion methods), it is preferable to filter by the count (rather than frequency) of the minor allele, because variation in the amount of missing data across an alignment will cause a static frequency cut‐ off to remove different SFS classes at different sites https://github.com/cjbattey/LinckBattey2017_MAF_clustering/tree/master
For geno_miss, the PLINK .vmiss documentation is a bit unclear on this. What does geno_miss do? In other words, what counts as a missing dosage? I can think of a few interpretations:
I ran it with the line changed to --maf_target 0 in modules/local/ancestry/intersect_variants.nf and got:
Jul-23 16:53:32.703 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 16:58:32.757 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:03:32.814 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:04:16.189 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: COMPLETED; exit: 1; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:04:16.194 [TaskFinalizer-6] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); work-dir=/home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce
error [nextflow.exception.ProcessFailedException]: Process `PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)` terminated with an error exit status (1)
Jul-23 17:04:16.223 [TaskFinalizer-6] ERROR nextflow.processor.TaskProcessor - Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)'
Caused by:
Process `PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)` terminated with an error exit status (1)
Command executed:
pgscatalog-intersect --ref GRCh37_1000G_ALL.pvar.zst --target GRCh37_file_ALL.pvar.zst --chrom ALL --maf_target 0 --geno_miss 0.1 --outdir . -v
n_matched=$(sed -n '3p' intersect_counts_ALL.txt)
if [ $n_matched == "0" ]
then
echo "ERROR: No variants in intersection"
exit 1
else
mv matched_variants.txt.gz file_ALL_matched.txt.gz
fi
cat <<-END_VERSIONS > versions.yml
INTERSECT_VARIANTS:
pgscatalog.match: $(echo $(python -c 'import pgscatalog.match; print(pgscatalog.match.__version__)'))
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
pgscatalog.match.cli.intersect_cli: 2024-07-23 17:01:42 INFO Processed 12000000 TARGET variants
[...]
pgscatalog.match.cli.intersect_cli: 2024-07-23 17:04:16 INFO Processed 27904794 TARGET variants
pgscatalog.match.cli.intersect_cli: 2024-07-23 17:04:16 INFO Outputting TARGET variants -> target_variants.txt.gz
Traceback (most recent call last):
File "/app/pgscatalog.utils/.venv/bin/pgscatalog-intersect", line 8, in <module>
sys.exit(run_intersect())
^^^^^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/intersect_cli.py", line 172, in run_intersect
for v in heapq.merge(
File "/usr/local/lib/python3.11/heapq.py", line 376, in merge
h_append([key(value), order * direction, value, next])
^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/intersect_cli.py", line 174, in <lambda>
key=lambda v: (v["CHR:POS:A0:A1"], v["ID_TARGET"], v["REF_TARGET"]),
~^^^^^^^^^^^^^^^^^
KeyError: 'CHR:POS:A0:A1'
cp: '.command.out' and '.command.out' are the same file
cp: '.command.err' and '.command.err' are the same file
cp: '.command.trace' and '.command.trace' are the same file
[...]
Jul-23 17:04:16.239 [TaskFinalizer-6] INFO nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Jul-23 17:04:16.240 [Actor Thread 38] ERROR nextflow.Nextflow - ERROR: Matching subworkflow failed
Jul-23 17:04:16.251 [main] DEBUG nextflow.Session - Session await > all processes finished
Jul-23 17:04:16.252 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: local) - terminating tasks monitor poll loop
Jul-23 17:04:16.252 [main] DEBUG nextflow.Session - Session await > all barriers passed
Jul-23 17:04:16.260 [Actor Thread 35] ERROR nextflow.Nextflow - ERROR: No results report written!
Jul-23 17:04:16.268 [Actor Thread 42] ERROR nextflow.Nextflow - ERROR: No scores calculated!
Jul-23 17:04:16.275 [Actor Thread 45] ERROR nextflow.Nextflow - ERROR: Projection subworkflow failed
The logs show this command was ran: pgscatalog-intersect --ref GRCh37_1000G_ALL.pvar.zst --target GRCh37_file_ALL.pvar.zst --chrom ALL --maf_target 0 --geno_miss 0.1 --outdir . -v
The target_variants.txt.gz file only contains a header line with the expected 'CHR:POS:A0:A1' column, with no other lines.
I changed it to 0.0 instead of "0" and it ran through. Here are the results:
Before and after the change: | PC | DEFAULT | WITH CHANGE |
---|---|---|---|
PC1 | 23.60 | 23.66 | |
PC2 | -30.01 | -30.60 | |
PC3 | -0.00 | 0.22 | |
PC4 | 8.52 | 8.74 | |
PC5 | 0.37 | -0.10 | |
PC6 | 0.30 | 0.67 | |
PC7 | -1.00 | 0.19 | |
PC8 | -7.53 | -7.64 | |
PC9 | 1.29 | 0.87 | |
PC10 | -1.36 | -1.48 |
The random forest ancestry assignment probabilities were a bit more accurate after the change. The main issue of extreme percentile values has remained the same: a large percentage of the scores are very high (100.0) or low (0.0).
Changing geno_miss to 1.0, I got the same PC results, and the same issue with the extreme percentile scores.
Example for various scores, with --maf_target 0.0 and --geno_miss 1.0 set:
Here is an example of a normal result:
PGS002228_hmPOS_GRCh37
The below results seem to be off, and happen for most scores:
PGS000758_hmPOS_GRCh37
PGS000892_hmPOS_GRCh37
PGS000921_hmPOS_GRCh37
PGS002308_hmPOS_GRCh37
PGS002764_hmPOS_GRCh37
PGS002771_hmPOS_GRCh37
PGS003725_hmPOS_GRCh37
See #343 for parallel discussion. Have fixed the filters so that low MAF variants can still be included and the next update of the calculator will allow for these changes (and should revert back to the pre-beta behaviour if the filters are removed).
The new parameters are available in v2.0.0-beta.2 🥳
Description of the bug
On the most recent dev branch, I have noticed VCF files having incorrect ancestry results. That is, the PC values that are assigned are unusual and aren't near any other value in 1000G, and the ancestry assignment (closest population) is incorrect: the ancestry does not (not even close) to matching the self-reported ancestry of the samples, and previous runs on these samples have resulted in correct ancestry assignments (and normal PC values). I am using the run ancestry flag.
These are imputed VCFs. The match rate to the reference panel is about 1/3, and the match rate to the scorefile is just under 100%. The genome build used is correct.
I'm wondering if there are any ideas that I can explore to fix this or if this behavior has been seen before. Along with the failed ancestry analysis, the Z_norm2 values (which seem normal) are discrepant from the percentile in the most similar population values (which are at 100.0%, abnormally high).
I was able to replicate this across multiple VCFs, though each VCF was submitted one at a time.
Command used and terminal output
No response
Relevant files
No response
System information
Linux. -dev branch of pgsc_calc. Singularity.