KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes
https://autometa.readthedocs.io
Other
40 stars 15 forks source link

in check_samplesheet.py, incorporate more flexible encoding when reading #249

Closed shaneroesemann closed 2 years ago

shaneroesemann commented 2 years ago

name: in check_samplesheet.py, incorporate more flexible encoding when reading about: prevent issues in reading .csv with non- utf-8 encoding


Current Behavior

When preparing my sample_sheet.csv, I accidentally used a utf-8-sig encoding instead of the standard utf-8 that is anticipated by python. This then caused my column headers to not pass the check and therefore my run was stopped. Some windows text editors may use utf-8-sig encoding by default.

Steps to Reproduce


# create a file in utf-8-sig encoding

file_in='sample_sheet.csv'
with open(file_in, "w", encoding='utf-8-sig') as fh:
        fh.write('sample,assembly,fastq_1,fastq_2,coverage_tab,cov_from_assembly')

# then try to read it with utf-8 encoding
with open(file_in, "r") as fh:
        header = fh.readline().strip()
        header_cols = [header_col.strip('"') for header_col in header.split(",")]

# output
['\ufeffsample',
 'assembly',
 'fastq_1',
 'fastq_2',
 'coverage_tab',
 'cov_from_assembly']

Expected Behavior

By changing the reader to incorporate utf-8-sig encoding, we will be able to handle either case (utf-8 and utf-8-sig) without losing any functionality.

Example:

with open(file_in, "r",encoding='utf-8-sig') as fh:
        header = fh.readline().strip()
        header_cols = [header_col.strip('"') for header_col in header.split(",")]