PGScatalog / pgsc_calc

The Polygenic Score Catalog Calculator is a nextflow pipeline for polygenic score calculation
https://pgsc-calc.readthedocs.io/en/latest/
Apache License 2.0
106 stars 19 forks source link

ValueError: could not broadcast input array from shape (3065,) into shape (2777,) #325

Open cwarlysolsberg opened 3 weeks ago

cwarlysolsberg commented 3 weeks ago

Description of the bug

I have run PGSC_calc many times on different files but this is the only time I am running with liftover and with run_ancestry. Without run_ancestry, it runs fine. Otherwise it fails at fraposa.py.

Join mismatch for the following entries: key=[chrom:ALL, n:0, effect_type:additive] values=[]

Loading study data... Traceback (most recent call last): File "/venv/bin/fraposa", line 8, in sys.exit(main()) File "/venv/lib/python3.10/site-packages/fraposa_pgsc/fraposa_runner.py", line 56, in main fp.pca(ref_filepref=ref_filepref, stu_filepref=stu_filepref, stu_filt_iid=stu_filt_iid, out_filepref=out_filepref, File "/venv/lib/python3.10/site-packages/fraposa_pgsc/fraposa.py", line 520, in pca W, W_bim, W_fam = read_bed(stu_filepref, dtype=np.int8, filt_iid=stu_filt_iid) File "/venv/lib/python3.10/site-packages/fraposa_pgsc/fraposa.py", line 148, in read_bed bed[i,:] = genotypes[i_extract] ValueError: could not broadcast input array from shape (3065,) into shape (2777,)

Command used and terminal output

sudo nextflow run pgscatalog/pgsc_calc -r ccfd6367d55eee6d81c36541248d757ebacf6c7e -profile docker \
    --input $path/samplesheet.csv \
    --scorefile $path/scorefile_reformatted.txt \
    --liftover \
    --target_build GRCh38 \
    --hg19_chain $path/hg19ToHg38.over.chain.gz \
    --hg38_chain $path//hg38ToHg19.over.chain.gz \
    --run_ancestry $path/pgsc_HGDP+1kGP_v1.tar.zst \
    --outdir $path/results

Relevant files

No response

System information

No response

nebfield commented 3 weeks ago

Thanks for the bug report 😄 This is a strange problem that I can't reproduce. Perhaps it might be caused by the cache if you've successfully ran pgsc_calc a lot before. Could you try rm -r work before retrying?

cwarlysolsberg commented 2 weeks ago

yes I have removed work and results before retrying. not sure if this matter but i also get a file named GRCh38_out_oriented_out_splitfamab.pcs which is weird because in the past i've always seen GRCh38_out_oriented_out_splitfamaa.pcs

I have also tried running this on numerous releases just to make sure its not an issue in the newest release (getting the same error)

nebfield commented 2 weeks ago

Thanks for the extra details - the different file name is interesting. Could you please attach the .nextflow.log file from a broken run? The log gets created in the same directory where you run the workflow - it just contains metadata about the state of the workflow.

cwarlysolsberg commented 2 weeks ago

Attached is the log file. nextflow (4).log

cwarlysolsberg commented 2 weeks ago

I figured it out. I had combined a bunch of cohorts and some of them used the same ID's. For some reason this error happened because of duplicate ids. Maybe just put in a simple FAM check to confirm there are no duplicate ids before moving forward haha. Everything else ran fine because other packages considered FID and IID so i just got an ambigous error. I was finally able to figure it out using the pygsc with some error handling in the .py code.

nebfield commented 2 weeks ago

Thanks for debugging! I was quite confused 😅