dlesl / gb-io

A Rust library for parsing, writing and manipulating Genbank sequence files
MIT License
14 stars 5 forks source link

Panic when calling `gb_io::seq::Seq::set_origin`: `assertion failed: len == self.seq.len()` #9

Open J-Wall opened 1 month ago

J-Wall commented 1 month ago

Thanks for the overall great library for working with an... interesting... file format.

I ran into a failed assertion when I was trying to rotate records from refseq mitochondria. Turns out NCBI don't actually include the sequences in that file anymore. The CONTIG field now resembles something like a location, but with some external identifier as the sequence... As a result, the record's .len() is some positive number (i.e. 63999 for NW_017264706.1) but .seq.len() is 0.

The change in what's included in the refseq mitogenome genbank files notwithstanding, I think this behavior is a bug. At the very least an Err should be returned, especially since one might like to call .set_origin() on a Seq that's been modified after it's been parsed from a file (and therefore after the validation occurs, as I understand it). Or perhaps it should even succeed if the sequence data is empty (i.e. just rotate the features for an annotation-only genbank record).

Handling this type of CONTIG field is probably a separate issue...

dlesl commented 1 month ago

Thanks for the overall great library for working with an... interesting... file format.

Thanks!

Turns out NCBI don't actually include the sequences in that file anymore. The CONTIG field now resembles something like a location, but with some external identifier as the sequence.

I did a bit of digging and it looks like these records belong to the CON division as described in the Genbank README.

Sequence records in the files of the CON division (gbcon*.seq) are unusual in that they contain no sequence data at all. Rather, records in these files contain instructions for the construction of larger-scale objects from 'traditional' GenBank records.

My initial reaction here is that since the features are annotations on a sequence, they probably don't mean so much without that sequence? It sounds like you might have a use-case where that's not the case though? We could consider removing the validation in set_origin. As a temporary workaround you could just create an appropriately sized Vec of 0s?

I suppose the correct solution would be parsing the CONTIG field and manipulating it in the same way we do the features

As you mentioned, it seems like it works very similarly to feature locations except that it's always wrapped in an outer join, so we maybe we could reuse the existing parsers and logic for relocation/rotation.

J-Wall commented 1 month ago

Thanks for the quick reply.

It sounds like you might have a use-case where that's not the case though?

Actually in my use case I'm very much interested in the sequence, so it's mostly annoying that refseq doesn't include them in the records directly (interestingly they are happy to include the amino acid sequences on the gene features). If your interested, my use case calls for reference sequences which are all rotated to the same gene, hence my need for both the sequences and annotations.

We could consider removing the validation in set_origin.

I actually dont mind the validation here, the reason I raised the issue is mostly just that I don't think panicking is the right behaviour in this sort of case. Of course if you return a Result instead then it's an API change, although probably worth it imo. Maybe could remove the assert, then add a separate method like set_origin_checked which returns a Result to avoid backward incompatability?

As a temporary workaround you could just create an appropriately sized Vec of 0s?

Ultimately what I did was just download the .fna.gz file which is alongside the .gbff.gz file in refseq, and has the same sequences in the same order in fasta format (but includes all the actual sequences, unlike the .gbff.gz). So I can mutably replace Seq.seq with the sequence obtained from the fasta before calling set_origin. I think this is what I'm going to have to do regardless of what you decide to do with this issue.

dlesl commented 1 month ago

I actually dont mind the validation here, the reason I raised the issue is mostly just that I don't think panicking is the right behaviour in this sort of case. Of course if you return a Result instead then it's an API change, although probably worth it imo. Maybe could remove the assert, then add a separate method like set_origin_checked which returns a Result to avoid backward incompatability?

True, I need to think about this a bit. In a way, the Seq struct we're calling set_origin on is invalid. But since all its fields are public the library isn't designed to prevent the existence of an invalid Seq (not to mention that the parser will happily create one). So yea maybe returning Result makes sense.

As a temporary workaround you could just create an appropriately sized Vec of 0s?

Ultimately what I did was just download the .fna.gz file which is alongside the .gbff.gz file in refseq, and has the same sequences in the same order in fasta format (but includes all the actual sequences, unlike the .gbff.gz). So I can mutably replace Seq.seq with the sequence obtained from the fasta before calling set_origin. I think this is what I'm going to have to do regardless of what you decide to do with this issue.

If we parsed the CONTIG line, we could perhaps provide a function like Seq::hydrate(fetcher: impl FnMut(&str) -> Result<&[u8], Err>) (similar to Location::extract_location_with_fetcher). Then all you would need to do is provide a function that knows how to get the sequences (eg. from entrez efetch).

BTW in the case of your specific problem, efetch in gbwithparts mode might be a solution? eg. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NW_017264706.1&rettype=gbwithparts&retmode=text