Edinburgh-Genome-Foundry / DnaChisel

:pencil2: A versatile DNA sequence optimizer
https://edinburgh-genome-foundry.github.io/DnaChisel/
MIT License
219 stars 40 forks source link

AvoidPattern with regex #68

Closed deto closed 1 year ago

deto commented 1 year ago

I ran into an issue recently where I was using a regex in my AvoidPattern and didn't realize that I needed to specify the size. What happened was that even though there plenty of available solutions avoiding the pattern, I was still getting an error that no solution could be found. This was because without a size, regexes dont' optimize on small windows and instead were introducing random perturbations on the entire sequence. Since it was a large sequence, the odds of any of these perturbations fixing the constraint was very low.

The fix was just to specify it as a SequencePattern instead (where I could specify the size) and feed that into AvoidPattern.

However, this really wasn't obvious to me - took quite a while of debugging and reading through the codebase to figure out.

I wanted to file this issue to suggest either a different approach to documenting or warning about this. Maybe either:

A) allow specifying 'size' in AvoidPattern (if providing a string for the pattern instead of SequencePattern object) and document the necessity of 'size' if it's a regex

or B) throw an error when a regex is specified without a size

or C) disallow specifying patterns with strings with AvoidPattern so that users have to go through SequencePattern and are more likely to see the note about how specifying size is needed (still relies on people reading all the docs, though).

Thanks for this tool - it has been incredibly useful and I'm just trying to think of ways to make it better.

Zulko commented 1 year ago

Thanks and sorry for your pain! That's one of the risks of user-friendly interfaces that try to do too much with the inputs.

My two cents:

Leaving the decision to @veghp, but I would suggest (D) no change in the code but have the docstring (and everywhere else relevant!) clearly specify that if using a regex to actively remove sites, then it should be provided as SequencePattern(my_regex, size=6) so as to inform DnaChisel during optimization.

veghp commented 1 year ago

This has already been pointed out in the SequencePattern docstring. I noted this in other places where relevant. I point to the docstring to single-source the information. But in most SequencePattern classes the size is fix and automatically calculated.

I deleted the SequencePattern docstring "(if none provided, the size of the pattern string is used)" as that's not true.

https://github.com/Edinburgh-Genome-Foundry/DnaChisel/commit/a060a9e25c7dbaa62f6a7b1b5581152eeb6b3edc