bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 12-13: invalid continuation byte #2755

Closed thommohr closed 5 years ago

thommohr commented 5 years ago

Dear developers,

after an upgrade to 1.1.5a (the genome issue). When running a pipeline (working in 1.1.1), I got following error:

[2019-04-06T17:32Z] multiprocessing: organize_samples [2019-04-06T17:32Z] Using input YAML configuration: /srv/workspace/rprojects/tmohr/MEDUNI/TUMOR/SKCM/SIBILIA-SKCM0010/project_hg38/config/project_hg38.yaml [2019-04-06T17:32Z] Checking sample YAML configuration: /srv/workspace/rprojects/tmohr/MEDUNI/TUMOR/SKCM/SIBILIA-SKCM0010/project_hg38/config/project_hg38.yaml Traceback (most recent call last): File "/opt/bcbio/bin/bcbio_nextgen.py", line 238, in main(kwargs) File "/opt/bcbio/bin/bcbio_nextgen.py", line 46, in main run_main(kwargs) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 45, in run_main fc_dir, run_info_yaml) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel for xs in pipeline(config, run_info_yaml, parallel, dirs, samples): File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 126, in variant2pipeline [x[0]["description"] for x in samples]]]) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore for data in joblib.Parallel(parallel["num_jobs"], batch_size=1, backend="multiprocessing")(joblib.delayed(fn)(x) for x in items): File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 921, in call if self.dispatch_one_batch(iterator): File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch self._dispatch(tasks) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 716, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 549, in init self.results = batch() File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 225, in call for func, args, kwargs in self.items] File "/opt/bcbio/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 225, in for func, args, kwargs in self.items] File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/utils.py", line 55, in wrapper return f(args, *kwargs) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/distributed/multitasks.py", line 424, in organize_samples return run_info.organize(args) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/run_info.py", line 61, in organize is_cwl=is_cwl, integrations=integrations) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/run_info.py", line 1025, in _run_info_from_yaml _check_sample_config(run_details, run_info_yaml, config) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/run_info.py", line 791, in _check_sample_config _check_quality_format(items) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/run_info.py", line 670, in _check_quality_format fastq_format = _detect_fastq_format(fastq_file) File "/opt/bcbio/anaconda/lib/python3.6/site-packages/bcbio/pipeline/run_info.py", line 629, in _detect_fastq_format for line in four: File "/opt/bcbio/anaconda/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 12-13: invalid continuation byte

Any ideas what is happening ? best and thanks for the help, Thomas

chapmanb commented 5 years ago

Thomas; Thank you for the report and apologies about the issues. The latest release uses Python 3, which is more careful about string encodings, and is complaining because your fastq file has some non-utf8 characters. I pushed a speculative fix which should resolve this if that's really the cause. The other potential issue might be that your input fastq is gzipped or otherwise compressed but the file extension does not match that and bcbio doesn't know. If that's the case, then adjusting the file names to match the compression will hopefully get things working cleanly. Hope one of these two gets your analysis finished.

thommohr commented 5 years ago

Thanks for your quick reply, I upgraded with the -u development option, but that does not resolve the issue. The files are bzip2 compressed, with the extension .fastq.bz2. The pipeline had no problems using the version 1.1.1, so the compression should be OK. How does one force bcbio to recognize these files ?

chapmanb commented 5 years ago

Thomas; Thanks much for following up with the additional details. This helped isolate the issue, which wasn't really a python3 problem but rather python3 exposing that we shouldn't have been trying to do the automated format detection with bzip2 input. I pushed a fix for this, so if you update one more time and retry I hope it will now work cleanly for you. Thank you again for the help debugging and please let us know if you run into any other issues.