Add encoding option for CravatFile

KarchinLab / open-cravat

A modular annotation tool for genomic variants

MIT License

113 stars 27 forks source link

Add encoding option for CravatFile #141

Closed skanderson closed 1 year ago

skanderson commented 1 year ago

Optional encoding parameter to specify character encoding for files that we wrote to prevent detection errors with chardet

skanderson commented 1 year ago

Fix for #47

Temporary files had a chance of being read with the wrong character encoding if only a small number of unicode characters are present. chardet is being used to determine character encoding, but it can detect incorrectly if there is not enough information. Since we're writing the temp files ourselves with 'utf-8' encoding, we can know the encoding for certain when we read it back to complete the annotation.

kmoad commented 1 year ago

Since CravatWriter defaults to utf-8, I think we can default to reading utf-8 in CravatReader. Instead of defaulting to chardet, and setting utf-8 in aggregator.py. This would fix encoding issues for other uses of CravatReader as well.

The assumption that encoding was unknown probably dates back to a time when the .var files could stay on disk and be reused. Made sense to force writer to utf-8 but let reader use chardet for backwards compatability.

skanderson commented 1 year ago

@kmoad I have updated the branch to set the default encoding to 'utf-8' for CravatReader. With this change, any CravatReader that reads a file from an unknown source should be defined with encoding=None to allow chardet to detect the encoding. Defaulting to 'utf-8' will work smoothly with any with any file that we generate with CravatWriter.