Error in Dseqrecord.format

manulera commented 4 months ago

For long sequence names, the name "eats up" the rest of the locus line, before a minimal example.

I will fix this.

from pydna.parsers import parse as pydna_parse
from pydna.dseqrecord import Dseqrecord
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

for cls in [Dseqrecord, SeqRecord]:
    print('using', cls.__name__)
    print('================================')
    dseqr = cls(Seq('ACGTA'))
    for i in range(30, 35):
        dseqr.name = 'a' * i
        dseqr.annotations['molecule_type'] = ['DNA']
        str_seq = dseqr.format('genbank')
        print(str_seq.split('\n')[0])
        try:
            dseqr2 = pydna_parse(str_seq)
        except Exception as e:
            print(f'Error at {i}')

            break

Prints this, note how in the locus line the DNA word clashes with linear as the name length increases.

using Dseqrecord
================================
/Users/Manu/Documents/Projects/ShareYourCloning/ShareYourCloning_backend/.venv/lib/python3.11/site-packages/Bio/SeqIO/InsdcIO.py:727: BiopythonWarning: Increasing length of locus line to allow long name. This will result in fields that are not in usual positions.
  warnings.warn(
LOCUS       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 5 bp    DNA linear       UNK 01-JAN-1980
LOCUS       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 5 bp    DNAlinear        UNK 01-JAN-1980
LOCUS       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 5 bp    DNlinear         UNK 01-JAN-1980
LOCUS       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 5 bp    Dlinear          UNK 01-JAN-1980
LOCUS       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 5 bp    linear           UNK 01-JAN-1980
Error at 34
using SeqRecord
================================
LOCUS       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 5 bp    DNA              UNK 01-JAN-1980
LOCUS       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 5 bp    DNA              UNK 01-JAN-1980
LOCUS       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 5 bp    DNA              UNK 01-JAN-1980
LOCUS       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 5 bp    DNA              UNK 01-JAN-1980
LOCUS       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 5 bp    DNA              UNK 01-JAN-1980

PS: @hiyama341 a heads-up in case this affects your libraries.

BjornFJohansson commented 4 months ago

The offending code is in the Dseqrecord.format method:

https://github.com/BjornFJohansson/pydna/blob/989e76eec0e46775aac7df1cab7c957834117608/src/pydna/dseqrecord.py#L509

manulera commented 4 months ago

Thanks @BjornFJohansson fixed with #239. Before saving, I make a copy of the Dseqrecord, set the topology of the copy from the dseqrecord.circular prop, and use biopython's format from Seqrecord.

BjornFJohansson commented 4 months ago

done!?

manulera commented 4 months ago

Yes!

BjornFJohansson / pydna

Error in Dseqrecord.format #238