Several GenBank entries raise RuntimeError due to Unicode characters

dlesl / gb-io

A Rust library for parsing, writing and manipulating Genbank sequence files

MIT License

14 stars 5 forks source link

Several GenBank entries raise RuntimeError due to Unicode characters #5

Closed xapple closed 1 year ago

xapple commented 1 year ago

So I went ahead and tested this library by parsing all GenBank entries from:

https://ftp.ncbi.nlm.nih.gov/genbank/

The parser fails and raises an Exception on two entries out of >200000000 because of an "®" character.

Do you think one could fix this? Thanks!

dlesl commented 1 year ago

Genbank files are supposed to only contain ASCII, but gb-io also supports unicode, provided that the input is UTF-8. The problem with these files is that they are actually not valid UTF-8, but rather latin-1 encoding. You can convert the file with iconv -f latin1 -t utf8, or in Rust as in this example

xapple commented 1 year ago

What about an option to simply substitute with the Unicode Replacement character � and not raise an exception? I understand there is a spec and that NCBI is not following its own spec here, but I also feel that a GB parser should not choke on files being published by GenBank itself.

dlesl commented 1 year ago

Yea, in a way the files published by NCBI define the spec, so it does seem weird to choke on their files. However I'm hesitant to work around it since I think it would be a more surprising behaviour to silently accept these files. BioPython made the same decision for their parser (discusion here).

So I think the cleanest solution here is for the library's users to preprocess the stream, removing/replacing the invalid characters as they desire.