Closed csjohnson23 closed 3 months ago
Thanks for the bug report. Could you test with the development branch please?
$ nextflow run pgscatalog/pgsc_calc -r dev -latest \
-profile singularity \
--input samplesheet.csv \
--pgs_id PGS004725 \
--target_build GRCh37 \
-c /PHShome/cj773/configs/pgs_calc_test.config
Some VCFs are causing unexpected columns to be written to the variant information files. I just updated this branch to write a consistent column set which should hopefully help.
You'll also need to delete the work
directory to remove the cache.
Tried it a few times, but the dev branch consistently breaks on DOWNLOAD_SCOREFILES with a socket timeout. I tried with submitting to a couple different partitions to make sure its not a cluster issue, and kept all else the same. This step passes on the non-dev branch.
Error is as follows: "[b1/d738c6] Submitted process > PGSCATALOG_PGSCCALC:PGSCCALC:DOWNLOAD_SCOREFILES ([pgs_id:PGS004725, pgp_id:, trait_efo:]) WARN: [SLURM] queue (normal) status cannot be fetched
There are no changes to the download process on the dev branch, which is typically reliable. Socket timeouts are more likely to be a cluster problem, or perhaps a temporary network issue.
If the problem is consistent perhaps you could use pgscatalog-download
to preload scoring files?
You could use the --scorefile
parameter to use the downloaded scoring files.
You can install the pgscatalog.core
package with pip or bioconda.
I also got this error as well.
Some VCFs are causing unexpected columns to be written to the variant information files. I just updated this branch to write a consistent column set which should hopefully help.
The VCF I am testing with has ran successfully before. I tried the development version, which fixed the error and the pipeline completed successfully.
pgscatalog.match.cli.match_cli: 2024-07-02 18:44:40 DEBUG Verbose logging enabled
pgscatalog.match.cli.match_cli: 2024-07-02 18:44:40 INFO --cleanup set (default), temporary files will be deleted
pgscatalog.match.lib.scoringfileframe: 2024-07-02 18:44:40 DEBUG Converting ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) to feather format
pgscatalog.match.lib.scoringfileframe: 2024-07-02 18:44:41 DEBUG ScoringFileFrame(NormalisedScoringFile('scorefiles.txt.gz')) feather conversion complete
pgscatalog.match.lib._match.preprocess: 2024-07-02 18:44:41 DEBUG Complementing column effect_allele
pgscatalog.match.lib._match.preprocess: 2024-07-02 18:44:41 DEBUG Complementing column other_allele
pgscatalog.match.lib.variantframe: 2024-07-02 18:44:41 DEBUG Converting VariantFrame(path='GRCh37_newautosomal_ALL.pvar.zst', dataset='newautosomal', chrom=None, cleanup=True, tmpdir=PosixPath('tmp')) to feather format
Traceback (most recent call last):
File "/app/pgscatalog.utils/.venv/bin/pgscatalog-match", line 8, in <module>
sys.exit(run_match())
^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/match_cli.py", line 87, in run_match
ipc_path = get_match_candidates(
^^^^^^^^^^^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/match_cli.py", line 124, in get_match_candidates
with variants as target_df:
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/variantframe.py", line 54, in __enter__
self.arrowpaths = loose(self.variants, tmpdir=self._tmpdir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/functools.py", line 909, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/_arrow.py", line 94, in _
return batch_read(reader, tmpdir=tmpdir, cols_keep=cols_keep)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/lib/_arrow.py", line 102, in batch_read
batches = reader.next_batches(batch_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/polars/io/csv/batched_reader.py", line 134, in next_batches
batches = self._reader.next_batches(n)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: found more fields than defined in 'Schema'
Consider setting 'truncate_ragged_lines=True'.```
Still have yet to make it to MATCH_VARIANTS on the dev branch to test the solution; pipeline gets stuck even when using downloaded files.
Also, when I try to run
nextflow run pgscatalog/pgsc_calc \
-profile singularity \
--input samplesheet.csv \
--scorefile downloads/PGS004725_hmPOS_GRCh37.txt.gz \
--target_build GRCh37 \
-c ~/configs/pgs_calc_test_slurm.config
it can no longer run, I get this error:
Project pgscatalog/pgsc_calc is currently stuck on revision: dev -- you need to explicitly specify a revision with the option `-r` in order to use it
Is there a way I could 1) update the my input vcfs to match the expected columns for the published v2 package and 2) circumvent the error above to be able to use the published version again?
@csjohnson23, you need to add -r dev
to your command and it should fix that error and run. If you want to edit VCFs you can truncate to only have standard pvar columns (https://www.cog-genomics.org/plink/2.0/formats#pvar).
After the latest updates of the pipeline, I started to have the exact same problem using docker and condo without any change in my data (BED, PGEN, or VCF).
After the latest updates of the pipeline, I started to have the exact same problem using docker and condo without any change in my data.
@SalemWerdyani, yes it's because we slightly changed the way we read the variant files. The current dev branch (adding -r dev
) should solve the problem for now if you don't want to truncate your VCF data columns.
I saw the earlier comments and tried adding -r dev, it did not work
I saw the earlier comments and tried adding -r dev, it did not work
Could you let us know what the error was? You may also have to add -latest
if you copy of the repo is out of date.
Thanks, Sam for the help and support, It works now
I've merged the fix now:
$ nextflow run pgscatalog/pgsc_calc -latest ...
should work OK, but the specific release -r v2.0.0-beta
will remain broken until we do a new patch.
Thanks everybody for reporting their experiences 😄
Description of the bug
I'm running the singularity version of the tool on a CentOS 7 Linux cluster. Every time I attempt to run it, whether directly via calling "nextflow run..." or via submitting to the cluster, the pipeline fails in the MATCH_VARIANT step. Specifically, it always fails at the point of "Converting VariantFrame(path='GRCh37_PROFILE2024_20.pvar.zst', dataset='PROFILE2024', chrom='20', cleanup=True, tmpdir=PosixPath('/tmp/nxf.jGbU3cW4rQ/tmp')) to feather format". I get the following error: "polars.exceptions.ComputeError: found more fields than defined in 'Schema".
So far, I've tried switching the input type (pfile vs vcf), switching the job scheduler (slurm vs lsf), and changing the sample sheet to be one combined file vs files split by chromosome, with the same result. I always remove the work and results directories before re-testing. The tests included with the pipeline run perfectly. I have a feeling it may have something to do temporary files not being handled as expected by my HPC.
I'd appreciate any help in debugging this error! Especially if there are any options I should set that might change how files in the matchtmp directory are handled.
samplesheet, config script, and run_pgscalc script are attached.
Command used and terminal output
Relevant files
Samplesheet: pfile_samplesheet.csv config file:
run_pgscalc_test.sh:
.nextflow.log output: nextflow.log
System information
Nextflow version: 24.04.2 Hardware: HPC Executor: slurm, lsf Container engine: Singularity OS: CentOS 7 Linux Version of pgsc_calc: v2.0.0-beta-gccfd636