PGScatalog / pgsc_calc

The Polygenic Score Catalog Calculator is a nextflow pipeline for polygenic score calculation
https://pgsc-calc.readthedocs.io/en/latest/
Apache License 2.0
106 stars 20 forks source link

Error - "Duplicated chromosome entries detected in samplesheet. Check your samplesheet." #338

Open gurpreet-bioinfo opened 1 month ago

gurpreet-bioinfo commented 1 month ago

Description of the bug

Hi, I have added path for multiple vcfs inside samplesheet.csv as input to the pipeline and kept chr column empty as recommended ("If the target genomic data file contains multiple chromosomes, leave empty.").

However, I am keep on getting this error: Duplicated chromosome entries detected in samplesheet. Check your samplesheet.

Below is a section from my samplesheet.csv :

sampleset,path_prefix,chrom,format
proj1,/analysis/L12/vcf/L12, ,vcf
proj1,/analysis/L13/vcf/L13, ,vcf
proj1,/analysis/L14/vcf/L14, ,vcf
proj1,/analysis/L15/vcf/L15, ,vcf

Thanks.

Command used and terminal output

nextflow run pgscatalog/pgsc_calc -profile singularity \
    --input samplesheet.csv \
    --target_build GRCh38 \
    --pgs_id PGS001013,PGS001015 \
    --run_ancestry pgsc_HGDP+1kGP_v1.tar.zst \
    --outdir $PWD/results

Relevant files

No response

System information

Nextflow version: 23.10.1 Hardwar: HPC Executor: Slurm Container Engine: Singularity OS: Linux pgsc_calc v2.0.0-beta-gccfd636

nebfield commented 1 month ago

The calculator works best with cohort data that have been imputed.

If you have one sample per VCF then you should merge your target genomes before using the calculator. Multiple rows in a samplesheet are for target genomes that have been split per chromosome.

gurpreet-bioinfo commented 1 month ago

Thanks @nebfield ! As per https://pgsc-calc.readthedocs.io/en/latest/how-to/prepare.html#vcf-from-wgs, does that mean I need to use plink2 to convert my all vcf files (each corresponding to a wgs from a patient) and that would be additional work which I did not expect by looking at the documentation? In that case, how does the format of my samplesheet.csv should look like? I am sorry but these items are not clear and straightforward from the documentation.

nebfield commented 1 month ago

WGS data can cause variant matching problems with the current version of the calculator. The calculator works best with genotyping array data that have been imputed to increase variant density.

Some users have been able to create compatible VCFs from WGS data but this requires some manual work to 1) create gVCFs from BAM files 2) merge gVCFs to create a multi-sample gVCF and 3) include nonvariant sites in the gVCF.

If you're able to create a multi-sample VCF from the WGS data your samplesheet would look like:

sampleset,path_prefix,chrom,format
merged,path/to/merged,,vcf