I have constructed a conda environment following the README instructions, but I am running into a UnicodeDecodeError that I can't seem to get past. Research of the error and review of the full traceback error (below) has indicated it is most likely due to a problem with trying to reading the reference FASTA as a gzip file; however, I have double checked all of my input files, and none of them are gzip'd, including the reference FASTA. Additionally, I constructed a second conda environment based on Python3.8 to determine if it could be a versioning problem, but I get the same error. Deliberately inputting a newly gzip'd reference FASTA also produces this error regardless of the environment used.
Any support you may have on how to remedy this issue is greatly appreciated!
Traceback (most recent call last):
File "/home/groups/Spellmandata/chiotti/tools/conda_envs/cnvkit/bin/cnvkit.py", line 10, in <module>
sys.exit(main())
File "/home/groups/Spellmandata/chiotti/tools/conda_envs/cnvkit/lib/python3.10/site-packages/cnvlib/cnvkit.py", line 10, in main
args.func(args)
File "/home/groups/Spellmandata/chiotti/tools/conda_envs/cnvkit/lib/python3.10/site-packages/cnvlib/commands.py", line 150, in _cmd_batch
args.reference, args.targets, args.antitargets = batch.batch_make_reference(
File "/home/groups/Spellmandata/chiotti/tools/conda_envs/cnvkit/lib/python3.10/site-packages/cnvlib/batch.py", line 94, in batch_make_reference
access_arr = access.do_access(fasta)
File "/home/groups/Spellmandata/chiotti/tools/conda_envs/cnvkit/lib/python3.10/site-packages/cnvlib/access.py", line 19, in do_access
access_regions = GA.from_rows(fa_regions)
File "/home/groups/Spellmandata/chiotti/tools/conda_envs/cnvkit/lib/python3.10/site-packages/skgenome/gary.py", line 90, in from_rows
table = pd.DataFrame.from_records(rows, columns=columns)
File "/home/groups/Spellmandata/chiotti/tools/conda_envs/cnvkit/lib/python3.10/site-packages/pandas/core/frame.py", line 2225, in from_records
first_row = next(data)
File "/home/groups/Spellmandata/chiotti/tools/conda_envs/cnvkit/lib/python3.10/site-packages/cnvlib/access.py", line 36, in <genexpr>
return (tup for tup in region_tups if is_canonical_contig_name(tup[0]))
File "/home/groups/Spellmandata/chiotti/tools/conda_envs/cnvkit/lib/python3.10/site-packages/cnvlib/access.py", line 43, in get_regions
for line in infile:
File "/home/groups/Spellmandata/chiotti/tools/conda_envs/cnvkit/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
CNVkit's FASTA reader doesn't appear to automatically handle gzipped input. So, the input should be uncompressed, as you had it.
The default "text" mode that it reads expects utf-8 encoding. If your FASTA sequence is encoded as e.g. latin1 instead, that can cause the same error message.
I have constructed a conda environment following the README instructions, but I am running into a UnicodeDecodeError that I can't seem to get past. Research of the error and review of the full traceback error (below) has indicated it is most likely due to a problem with trying to reading the reference FASTA as a gzip file; however, I have double checked all of my input files, and none of them are gzip'd, including the reference FASTA. Additionally, I constructed a second conda environment based on Python3.8 to determine if it could be a versioning problem, but I get the same error. Deliberately inputting a newly gzip'd reference FASTA also produces this error regardless of the environment used.
Any support you may have on how to remedy this issue is greatly appreciated!