dlesl / gb-io

A Rust library for parsing, writing and manipulating Genbank sequence files
MIT License
14 stars 5 forks source link

Allow disabling locus line truncation when writing records #1

Closed althonos closed 2 years ago

althonos commented 2 years ago

Hi David!

First of all let me thank you for this library, it has turned my GenBank parsing experience from painfully slow (1 hour with Biopython) to really fast (8 minutes with gb-io) when processing very large sequence predictions. I'm using it very often in read-only mode to analyze large-scale predictions.

However, I've not been very succesful with the writer: because I'm working on metagenomic data, it's not unusual for locus tags to be very long, and normally with Biopython I'd be able to write these (with a warning), but the formatter in gb-io will forcefully truncate them. This has resulted in quite painful bugs (very_long_id_143 becoming very_long_id_14 and all of a sudden i have plenty of duplicates) before I realized what was going on... Since most parser are eventually able to process these "long locus lines", I think it should be better if the library user (e.g. me) could choose to disable this, if they know that the files can be processed later by whatever parser they will read the GenBank with.

This PR adds a configurable SeqWriter that wraps a Write implementor, which can be configured with a builder pattern. I added flags to disable locus tag escaping and truncation, which are enabled by default for backwards compatibility.

dlesl commented 2 years ago

Hi Martin,

I'm glad you find the library useful and thanks for the PR! It looks good to me, but as you point out, the semi-silent truncation is a bit of a footgun so maybe it would be best to default to not truncating? It would break backwards compatibility as you say but I think that's worth it here.

From the release notes here these long identifiers are now legal so we can expect them appear in "official" sequences too:

3.4.4.1 : Important notice about parsing the LOCUS line

Users who process the data elements of the LOCUS line should use a token-based parsing approach rather than parsing its content based on fixed column positions.

Historically, the LOCUS line has had a fixed length and its elements have been presented at specific column positions. A complete description of the data fields and their typical column positions is provided in Section 3.4.4.2 . But with the anticipated increases in the lengths of accession numbers, and the advent of sequences that are gigabases long, maintaining the column positions will not always be possible and the overall length of the LOCUS line could exceed 79 characters.

If we were to make this the default, then there is also the question of whether we should warn as Biopython does. I'm tending towards no, given that it's now officially ok. OTOH doing the same as Biopython is always a good option since that's the behaviour most users are likely to expect.

BTW since it's being used I guess this project needs some love, I will look into setting up CI and fixing all the warnings the latest rust compiler emits as a first step!

althonos commented 2 years ago

Happy to hear this! Indeed, if you're okay with it, I guess it would be even better to disable the truncation by default. I'm not sure about the warning as well: the advantage is that Biopython warnings can be silenced (in the code using a warnings.catch_warnings context, or at the CLI level with the -W flag), but that's not the case with Rust warnings. In any case I'll rebase and update the truncation default to false.

BTW since it's being used I guess this project needs some love, I will look into setting up CI and fixing all the warnings the latest rust compiler emits as a first step!

Happy to hear this! I have some additional patches I can think about, so now that I know you are active I'll try to find the time to get them working.

dlesl commented 2 years ago

Happy to hear this! Indeed, if you're okay with it, I guess it would be even better to disable the truncation by default. I'm not sure about the warning as well: the advantage is that Biopython warnings can be silenced (in the code using a warnings.catch_warnings context, or at the CLI level with the -W flag), but that's not the case with Rust warnings. In any case I'll rebase and update the truncation default to false.

Given that the the long locus is now allowed by the spec, I'm starting to think we don't need any warnings here, so I went ahead and merged.

Happy to hear this! I have some additional patches I can think about, so now that I know you are active I'll try to find the time to get them working.

Great!