Closed xapple closed 1 year ago
Genbank files are supposed to only contain ASCII, but gb-io
also supports unicode, provided that the input is UTF-8. The problem with these files is that they are actually not valid UTF-8, but rather latin-1 encoding. You can convert the file with iconv -f latin1 -t utf8
, or in Rust as in this example
What about an option to simply substitute with the Unicode Replacement character � and not raise an exception? I understand there is a spec and that NCBI is not following its own spec here, but I also feel that a GB parser should not choke on files being published by GenBank itself.
Yea, in a way the files published by NCBI define the spec, so it does seem weird to choke on their files. However I'm hesitant to work around it since I think it would be a more surprising behaviour to silently accept these files. BioPython made the same decision for their parser (discusion here).
So I think the cleanest solution here is for the library's users to preprocess the stream, removing/replacing the invalid characters as they desire.
So I went ahead and tested this library by parsing all GenBank entries from:
The parser fails and raises an Exception on two entries out of >200000000 because of an "®" character.
Do you think one could fix this? Thanks!
https://ftp.ncbi.nlm.nih.gov/genbank/gbenv56.seq.gz
https://ftp.ncbi.nlm.nih.gov/genbank/gbbct902.seq.gz