Closed bioinformaticaomicalabs closed 1 month ago
The score didn't meet the default variant overlap threshold:
pgscatalog.match.lib.matchresult: 2024-10-03 12:51:43 WARNING Score PGS000119_hmPOS_GRCh38 matching failed with match rate 0.40625
pgscatalog.core.lib.pgsexceptions.ZeroMatchesError: All scores fail to meet match threshold 0.75
Since this is WGS data you either need to add homozygous REF sites from the scoring file back into the VCF, or using --min_overlap 0
. Note those two sets of results will be different and each setting has implications.
Hello, thanks for the help. I have reviewed my VCF, and it contains the homozygous positions for the reference. On the other hand, I have been checking all the processing outputs and noticed that in match/_log.csv.gz, the table is missing the IDs for the variants that do not match for my samples. Could it be possible that variants are not being parsed correctly? log.csv
The variant IDs are missing from the log because no matching variant is found in your target genomes.
For example, the unmatched variant in PGS000119 at chromosome 16 / position 89919709 has effect allele T
and other allele C
.
If you want this variant to match your target genomes, then the corresponding variant in your VCF should have REF T
& ALT C
or REF C
& ALT T
. If your data aren't formatted this way then it's less likely that a matching variant will be found.
There are some other matching strategies that automatically happen which we describe in a supplement but the approach described above is the best and simplest method.
Thank you for the information, with that I managed to write a R code for completing the information of the vcf based on the model. Thank you!
[!CAUTION] Note from PGS Catalog team: we have not tested this method.
I share the code in case it can help someone:
library(dplyr)
library(data.table)
# Load the VCF file and the model.txt file
vcf_file = "UDBsencillo.vcf"
model_file = "PGS000119_hmPOS_GRCh38.txt"
# Read the VCF file
vcf = fread(vcf_file, header = TRUE, skip = "#CHROM")
# Read the model.txt file and remove lines starting with #
model = fread(model_file, header = TRUE) %>%
filter(substr(rsID, 1, 1) != "#")
# Function to find and replace information in the VCF based on model.txt
process_vcf_model = function(vcf, model) {
for (i in 1:nrow(model)) {
# Get information from model.txt
hm_pos = model$hm_pos[i]
effect_allele = model$effect_allele[i]
hm_inferOtherAllele = strsplit(model$hm_inferOtherAllele[i], "/")[[1]]
# Search for the position in the VCF
vcf_row = which(vcf$POS == hm_pos)
if (length(vcf_row) > 0) {
# If ALT is ".", replace using REF and effect_allele from model.txt
ref_allele = vcf$REF[vcf_row]
alt_allele = vcf$ALT[vcf_row]
if (alt_allele == ".") {
# Get the first allele that is not the effect_allele
other_allele = hm_inferOtherAllele[1]
if (ref_allele == effect_allele) {
vcf$ALT[vcf_row] = other_allele
} else {
vcf$ALT[vcf_row] = effect_allele
}
}
}
}
return(vcf)
}
# Process the VCF file
vcf_modified = process_vcf_model(vcf, model)
# Save the modified VCF file
fwrite(vcf_modified, "complete.vcf", sep = "\t")
Description of the bug
Hello, first of all, congratulations on creating such a comprehensive tool like pgsc_calc. I have been trying to set up the analysis and encountered an issue that I don't understand how to solve or where it might come from. I am setting up the polygenic risk score analysis for individual samples with WGS data, and I started with the PGS000119 model. I have followed all the steps, including those reported here https://github.com/PGScatalog/pgsc_calc/discussions/123 to prepare the WGS data, but the process fails at the match_combine step. These are the steps I followed for preparing the data:
gVCF generation for PGS000119 specific positions (included in ejemplo_coordenadas.bed file)
/home/jc-server/BIOINFORMATICA/SOFTWARE/gatk/gatk --java-options "-Xmx4g" HaplotypeCaller -R /home/jc-server/BIOINFORMATICA/reference_genome/reference_benchmarking_genome_in_a_bottle/human_GRCh38_no_alt_analysis_set.fasta -L ejemplo_coordenadas.bed -I /home/jc-server/BIOINFORMATICA/AUTOMATIZACION/GENOMA/GERMINAL/OUTPUT/TEST/UDB/mapping/UDB_r_groups.bam -O UDBejemplo.gvcf -ERC BP_RESOLUTION --dbsnp ../00-All-chr.vcf.gz
VCF generation
/home/jc-server/BIOINFORMATICA/SOFTWARE/gatk/gatk --java-options "-Xmx4g" GenotypeGVCFs -R /home/jc-server/BIOINFORMATICA/reference_genome/reference_benchmarking_genome_in_a_bottle/human_GRCh38_no_alt_analysis_set.fasta -V UDBejemplo.gvcf --dbsnp ../00-All-chr.vcf.gz --include-non-variant-sites true -O UDBejemplo.vcf
chr removal from vcf
sed 's/chr\([0-9XYM]\)/\1/g' UDBejemplo.vcf > vcf/UDBejemplo-nochr.vcf
Command used and terminal output
Relevant files
I attach the generated vcf, as I understand its format is correct for the analysis UDBejemplo-nochr.zip
System information
Nextflow version: 24.04.4.5917 Hardware: desktop Executor: local Container: docker OS: Linux pgsc_calc: 9bd9c431e7 (main)