bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

bcbio-nextgen does not recognize bz2 compressed fastq files #2759

Closed thommohr closed 5 years ago

thommohr commented 5 years ago

Dear developers,

sorry for rising this again, this issue is related to issue #2755. I have checked that the last version as described in the response to this issue (patched utils.py) are indeed present. However, upon starting the pipeline we still get:

[2019-04-06T17:32Z] multiprocessing: organize_samples [2019-04-06T17:32Z] Using input YAML configuration: /srv/workspace/rprojects/tmohr/MEDUNI/TUMOR/SKCM/SIBILIA-SKCM0010/project_hg38/config/project_hg38.yaml [2019-04-06T17:32Z] Checking sample YAML configuration: /srv/workspace/rprojects/tmohr/MEDUNI/TUMOR/SKCM/SIBILIA-SKCM0010/project_hg38/config/project_hg38.yaml Traceback (most recent call last): File "/opt/bcbio/bin/bcbio_nextgen.py", line 238, in main(kwargs) File "/opt/bcbio/bin/bcbio_nextgen.py", line 46, in main run_main(kwargs) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 45, in run_main fc_dir, run_info_yaml) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel for xs in pipeline(config, run_info_yaml, parallel, dirs, samples): File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 126, in variant2pipeline [x[0]["description"] for x in samples]]]) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore for data in joblib.Parallel(parallel["num_jobs"], batch_size=1, backend="multiprocessing")(joblib.delayed(fn)(x) for x in items): File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 921, in call if self.dispatch_one_batch(iterator): File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch self._dispatch(tasks) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 716, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 549, in init self.results = batch() File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 225, in call for func, args, kwargs in self.items] File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 225, in for func, args, kwargs in self.items] File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/utils.py", line 55, in wrapper return f(args, *kwargs) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multitasks.py", line 424, in organize_samples return run_info.organize(args) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/run_info.py", line 61, in organize is_cwl=is_cwl, integrations=integrations) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/run_info.py", line 1025, in _run_info_from_yaml _check_sample_config(run_details, run_info_yaml, config) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/run_info.py", line 791, in _check_sample_config _check_quality_format(items) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/run_info.py", line 670, in _check_quality_format fastq_format = _detect_fastq_format(fastq_file) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/run_info.py", line 629, in _detect_fastq_format for line in four: File "/opt/bcbio/anaconda/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 12-13: invalid continuation

The previous version (1.1.1) worked like a charm. The fastq files are encoded as bzip2 with the extension fastq.bz2. I guess that bcbio-nextgen basd on python 3 has difficulties recognizing bz2 compressed files.

best, Thomas

chapmanb commented 5 years ago

Thomas -- we just pushed an additional fix to #2755 that I hope will correctly resolve the issue. Are you still having problems after including that update (which is separate from the utils.py fix):

https://github.com/bcbio/bcbio-nextgen/commit/629d6fb0a6522d48a676c5d54bd636d6d7ffc2d5#diff-0cf3043ab7d23b5c690b55cd4e4bc6e7

Hope this one fixes it for you.

thommohr commented 5 years ago

Hi chalmanp, indeed, that has done the trick !

Very good work !!!!

thanks, Thomas

roryk commented 5 years ago

Thank you for following up!