Open trenslow opened 1 year ago
Why would you want to do this, rather than aligning "bananas" to "bahamas" ?
Depending on the language that bananas and bahamas are said in, determines how "close" the individual letters/sounds are. That's why I would like to have the flexibility to dynamically create the substitution matrix. It could be that the "distance" in one language is far enough that I would prefer to open a gap where the characters line up as opposed to aligning them.
I meant use
word_1= 'bananas'
word_2 = 'bahamas'
instead of
word_1= ['b', 'a', 'n', 'a', 'n', 'a', 's']
word_2 = ['b', 'a', 'h', 'a', 'm', 'a', 's']
In this particular use-case I don't know ahead of time how long each element in the list will be, so I essentially have logic to handle this special case where all elements of the list are of length 1, as you've suggested here.
I just thought it made sense to report this as an issue, as I wouldn't expect it to cause a failed alignment, especially since it wasn't failing to align back in #4290.
The problem happens here:
>>> for char1 in alph:
... for char2 in alph:
... distances[(char1, char2)] = substitution_quantifier(char1, char2)
>>> substitution_matrix = Array(data=distances)
There is no way for Array
to know that char1
and char2
are coming from a list of characters instead of from a string.
The latter is the most common use case, so it will default to that:
>>> substitution_matrix.alphabet
'abhmns'
and then aligner.align
will expect a string instead of a list of characters.
Is there a better way to create and instantiate the substitution matrix so that the aligner isn't surprised?
I don't think there is at the moment, but it may not be so difficult to add that capability. Would you be willing to create a PR for that?
I unfortunately don't have any knowledge of C to be able to create a PR.
@trenslow
I dont think C will be needed for a PR.
I have been trying today to update the code, but a couple things are still unclear to me.
What does the development flow look like? It's not really obvious to me after reading the contributing doc. Normally I code Python services, so I would change code, rerun the service and call the endpoint under development to see if my changes work. How do Biopython devs normally check that their changed code is working?
What are we looking to change here? Are we looking to allow the aligner to accept lists of single strings if the alphabet is just a string, or are we looking for the alphabet to be represented as a list of single strings, rather than representing it as a single string?
How do Biopython devs normally check that their changed code is working?
The tests run automatically when you submit a PR.
What are we looking to change here? Are we looking to allow the aligner to accept lists of single strings if the alphabet is just a string, or are we looking for the alphabet to be represented as a list of single strings, rather than representing it as a single string?
The latter, but only if the user explicitly requests it. By default, we want to use a single string.
my use case isn't covered by the tests so i would want to check the functionality locally before opening the PR
my use case isn't covered by the tests so i would want to check the functionality locally before opening the PR
Sure. Go for it. Or you can add your example to the tests.
See also https://github.com/biopython/biopython/blob/master/CONTRIBUTING.rst for background information on making a contribution.
@mdehoon I've been breaking my head on this for the last couple weeks and I don't really see how I can change the code while staying within the requirements.
One quite simple fix for my use-case would be to remove the logic that collapses the alphabet into a single string if all the keys in the data dict are single letters. It doesn't cause any alignment tests to fail as far as I can tell, but it doesn't follow the requirement that the alphabet is expressed as a single string by default.
My original use-case is creating a custom substitution matrix, with varying values depending on the substitution. To do that, I have to pass a data dict on Array creation. The current way to "inform" the aligner of what data types might be aligned depends on how the data dict keys are shaped. If I introduce logic to cover some sort of dict key shape that uniquely identifies that the alphabet should be list, the branching/nesting gets even more complex than it already is (pylint is complaining that the new method has 27 branches, where the max should be 12, as well as 59 statements, where the max should be 50).
Maybe the more robust solution would be to warn users when passing data types to the align function that will cause a failed alignment due to the different data type of the alphabet. Because as it stands now, my use-case ends up in the C code with no clear indication of what went wrong. When I initially found the bug, I spent a lot of time doing trial and error to figure out what the source of the bug was.
One quite simple fix for my use-case would be to remove the logic that collapses the alphabet into a single string if all the keys in the data dict are single letters. It doesn't cause any alignment tests to fail as far as I can tell, but it doesn't follow the requirement that the alphabet is expressed as a single string by default.
If the alphabet is a string, then the aligner knows that the sequences consist of single characters, and it can use the integer value of each character to find the index of the row and column in the scoring matrix.
If the alphabet is a list, then it can contain any kind of object. The aligner then needs to compare each item in the two sequences to each item in the alphabet to find the row and column index. This is much slower.
I think all that is needed is to modify Array.__new__
such that it does not require the alphabet
to be None
if data
is a dictionary. Then some code needs to be added to confirm that alphabet
and data
are consistent in that case. If pylint complains, just ignore it.
Setup
I am reporting a problem with Biopython version, Python version, and operating system as follows:
Expected behaviour
I would expect that I can use the PairwiseAligner to align two lists of single-character strings (doable in #4290 )
Actual behaviour
When I attempt to align two lists of single-character strings, I get the following error:
ValueError: sequence has unexpected format
Steps to reproduce