Check behaviour of spaces when creating `Dseq` objects from a text representation

dgruano commented 3 weeks ago

Related to #321

Also testing single-stranded restriction products, I tried to create them from a representation:

Dseq.from_representation("""\

    CCGAATTAAT
    """)

I find it funky that the overhang of the sequence depends on the number of spaces that are present on the watson strand, and the length too when there is more (or less) spaces beyond the crick strand. This is a problem for testing. Here are some examples that may be important to consider:

No spaces in first line -> Sequence considered as `watson`

Dseq.from_representation("""\

    CCGAATTAAT
    """).__dict__

{'ovhg': 0,
 'watson': CCGAATTAAT,
 'crick': ,
 'circular': False,
 'length': 10,
 'pos': 0}

Only one space (indentation does not match) -> negative overhang and higher length

Dseq.from_representation("""\

    CCGAATTAAT
    """).__dict__

{'ovhg': -3,
 'watson': ,
 'crick': TAATTAAGCC,
 'circular': False,
 'length': 13,
 'pos': 0}

Four spaces (correct indentation) -> Seems the accurate way to type it, but ovhg = 0

Dseq.from_representation("""\

    CCGAATTAAT
    """).__dict__

{'ovhg': 0,
 'watson': ,
 'crick': TAATTAAGCC,
 'circular': False,
 'length': 10,
 'pos': 0}

Sequence full of spaces -> Accurate way to type it so it matches a 10-bases long single-stranded restriction product

Dseq.from_representation("""\

    CCGAATTAAT
    """).__dict__

{'ovhg': 10,
 'watson': ,
 'crick': TAATTAAGCC,
 'circular': False,
 'length': 10,
 'pos': 0}

More spaces than indent + crick length -> The length is higher than expected, overhang matches length

Dseq.from_representation("""\

    CCGAATTAAT
    """).__dict__

{'ovhg': 14,
 'watson': ,
 'crick': TAATTAAGCC,
 'circular': False,
 'length': 14,
 'pos': 0}

How would you go about fixing this? I can give it a look but don't want to break anything!

BjornFJohansson commented 1 week ago

Hi, I am actually working on a related thing right now. I have some ideas for expanding the representations for dsDNA.

I made the from_representation method in order to go from a figure similar to the ones made from the Dseq.__repr__() back to a Dseq object.

This method leaves it up to the user to correctly format the sequence. This format is imho not very good for storage.

We could add errors and warnings to the method to prevent malformed input.

I am curious what your use case might be?

dgruano commented 1 week ago

Both this issue and #321 happened when writing tests for the USER and Nickase enzymes. For visualization, I find it handy to create Dseq objects of the "restriction" products. However, some of these products end up being single-stranded, so I would need a way to create this "single-stranded product of a single-strand cut of a double-stranded Dseq".

I don't know if this would be a widespread usecase, but it was intuitive for me. And the alternative I could think of (#321) also gave some errors.

BjornFJohansson / pydna