Open de-code opened 7 years ago
Sorry for the lack of documentation. I should improve it sometime.
Global sequence alignment uses Needleman-Wunsch algorithm. Local sequence alignment uses Smith-Waterman algorithm.
About your blockers:
It does give you the original elements if you decode the aligned sequences. Check out the final loop in the example. As for the indices, you're right; it doesn't give you the original indices. This can definitely be improved.
I was a little bit obsessed about generating all alignments with the max score when I wrote the backtracing algorithm. Hence the recursion. It should be pretty easy to write an iterative, constant space backtracer that stops when it founds one solution.
So there seems to be three issues that needs fixing:
I hope I can find time to address these but please feel free to contribute if you have the motivation.
Thank you for getting back to me and clarifying it. What algorithm is StrictGlobalSequenceAligner meant to be?
Regarding getting the original elements back. I tried that. It didn't work. I realised it is because of the Vocabulary as it is converting it to a number. So I can get back some object that was put into the sequence but not the one from the particular sequence. (e.g. my object might contain the index, or a reference where the element originally came from).
Just some further question before I embark on a PR:
How would you feel about making the vocabulary optional? I can see it being a good optimisation when matching but doesn't seem to be required for the algorithm.
How would you feel about making breaking changes? e.g. if I was to address it, then I would probably change SequenceAlignment how it stores sequences and the backtracing (the latter is probably more internal).
Both GlobalSequenceAligner and StrictGlobalSequenceAligner use Needleman-Wunsch. GlobalSequenceAligner is modified to not include the initial and terminal gaps in the result. This is a common practice in bioinformatics. It is sometimes called semiglobal or glocal alignment.
Elements are assumed to have the equivalence relationship: They represent the same thing if and only if they are equal. If you have non-trivial objects as elements, you can implement __eq__
and __hash__
methods in a way that satisfies the equivalence relationship.
I wrote this library out of necessity and one of my requirements was to compare all pairs of sequences in a set. That's why I needed the vocabulary: You can encode all sequences once, perform the alignments on fast integer arrays and decode only the alignments that you find useful. I think that's a very common use case for alignment tasks and I wouldn't want to lose the performance gain that comes with vocabularies. However, if you think you can generalize the existing algorithms without causing extra lag and without duplicating the code, you can go for it. I'm sure Python's duck typing can handle it :)
I'm not sure how many people are actively using the library but I'm aware that several people used it in the past. I would want to break their code (nor mine). Constant time backtracing can definitely be implemented as an extension to the existing code. If you need to get rid of the vocabulary though, I would like to see the end result before committing to it.
Okay, you have a PR.
I have to admit I don't know much about bioinformatics. I just want to align text from different sources.
I did use __eq__
and __hash__
- but the issue with the vocabulary is that those elements are then equal (otherwise they wouldn't align). As it will then use the integer representation it won't be possible to translate the integer back to the original sequence object (unless I have the index or it is getting the object from the passed in sequence and not the vocabulary - both will work with the PR)
I see your problem. I guess you're right. Vocabulary is not ideal in cases where you cannot simply enumerate all possible elements.
I'm pretty much convinced that an interface that supports arbitrary lists of objects would be more "pythonic" and useful for non-bioinformatics-related problems. So, I suggest we refactor the whole library without caring too much about backwards compatibility and release a version 2.0. I'll try to organize this through issues and labels.
Okay, sure. I think it would be possible to be largely backward compatible by keeping the Sequence wrappers as an optional extra. But I guess it would be good to allow a fresh start. I was also considering Cython.
If you prefer to allow more time for the changes then I could keep my changes in a branch on my fork for now (that will allow me to continue with what I actually wanted to use it for).
Hi,
Thank you for providing the module. I was wondering whether you could provide information on the algorithms used? e.g. Smith Watermann? (While your code is well structured I am getting a bit lost with some of the variables)
I am considering using / modifying it, but have currently the following blockers: