biopython / biopython

Official git repository for Biopython (originally converted from CVS)
http://biopython.org/
Other
4.33k stars 1.75k forks source link

Bug when trying to do pairwise alignment on two lists where the elements are strings of length 1 #4431

Open trenslow opened 1 year ago

trenslow commented 1 year ago

Setup

I am reporting a problem with Biopython version, Python version, and operating system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import Bio; print(Bio.__version__)

3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
CPython
Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
1.81

Expected behaviour

I would expect that I can use the PairwiseAligner to align two lists of single-character strings (doable in #4290 )

Actual behaviour

When I attempt to align two lists of single-character strings, I get the following error:

ValueError: sequence has unexpected format

Steps to reproduce

from Bio.Align import PairwiseAligner
from Bio.Align.substitution_matrices import Array
aligner = PairwiseAligner()

word_1= ['b', 'a', 'n', 'a', 'n', 'a', 's']
word_2 = ['b', 'a', 'h', 'a', 'm', 'a', 's']

alph = set(word_1 + word_2)
distances = {}
for char1 in alph:
    for char2 in alph:
        distances[(char1, char2)] = substitution_quantifier(char1, char2)
aligner.substitution_matrix = Array(data=distances)

alignments = aligner.align(word_1, word_2)
mdehoon commented 1 year ago

Why would you want to do this, rather than aligning "bananas" to "bahamas" ?

trenslow commented 1 year ago

Depending on the language that bananas and bahamas are said in, determines how "close" the individual letters/sounds are. That's why I would like to have the flexibility to dynamically create the substitution matrix. It could be that the "distance" in one language is far enough that I would prefer to open a gap where the characters line up as opposed to aligning them.

mdehoon commented 1 year ago

I meant use

word_1= 'bananas'
word_2 = 'bahamas'

instead of

word_1= ['b', 'a', 'n', 'a', 'n', 'a', 's']
word_2 = ['b', 'a', 'h', 'a', 'm', 'a', 's']
trenslow commented 1 year ago

In this particular use-case I don't know ahead of time how long each element in the list will be, so I essentially have logic to handle this special case where all elements of the list are of length 1, as you've suggested here.

I just thought it made sense to report this as an issue, as I wouldn't expect it to cause a failed alignment, especially since it wasn't failing to align back in #4290.

mdehoon commented 1 year ago

The problem happens here:

>>> for char1 in alph:
...     for char2 in alph:
...         distances[(char1, char2)] = substitution_quantifier(char1, char2)
>>> substitution_matrix = Array(data=distances)

There is no way for Array to know that char1 and char2 are coming from a list of characters instead of from a string. The latter is the most common use case, so it will default to that:

>>> substitution_matrix.alphabet
'abhmns'

and then aligner.align will expect a string instead of a list of characters.

trenslow commented 1 year ago

Is there a better way to create and instantiate the substitution matrix so that the aligner isn't surprised?

mdehoon commented 1 year ago

I don't think there is at the moment, but it may not be so difficult to add that capability. Would you be willing to create a PR for that?

trenslow commented 1 year ago

I unfortunately don't have any knowledge of C to be able to create a PR.

mdehoon commented 1 year ago

@trenslow

I dont think C will be needed for a PR.

trenslow commented 1 year ago

I have been trying today to update the code, but a couple things are still unclear to me.

  1. What does the development flow look like? It's not really obvious to me after reading the contributing doc. Normally I code Python services, so I would change code, rerun the service and call the endpoint under development to see if my changes work. How do Biopython devs normally check that their changed code is working?

  2. What are we looking to change here? Are we looking to allow the aligner to accept lists of single strings if the alphabet is just a string, or are we looking for the alphabet to be represented as a list of single strings, rather than representing it as a single string?

mdehoon commented 1 year ago

How do Biopython devs normally check that their changed code is working?

The tests run automatically when you submit a PR.

What are we looking to change here? Are we looking to allow the aligner to accept lists of single strings if the alphabet is just a string, or are we looking for the alphabet to be represented as a list of single strings, rather than representing it as a single string?

The latter, but only if the user explicitly requests it. By default, we want to use a single string.

trenslow commented 1 year ago

my use case isn't covered by the tests so i would want to check the functionality locally before opening the PR

mdehoon commented 1 year ago

my use case isn't covered by the tests so i would want to check the functionality locally before opening the PR

Sure. Go for it. Or you can add your example to the tests.

peterjc commented 1 year ago

See also https://github.com/biopython/biopython/blob/master/CONTRIBUTING.rst for background information on making a contribution.

trenslow commented 1 year ago

@mdehoon I've been breaking my head on this for the last couple weeks and I don't really see how I can change the code while staying within the requirements.

One quite simple fix for my use-case would be to remove the logic that collapses the alphabet into a single string if all the keys in the data dict are single letters. It doesn't cause any alignment tests to fail as far as I can tell, but it doesn't follow the requirement that the alphabet is expressed as a single string by default.

My original use-case is creating a custom substitution matrix, with varying values depending on the substitution. To do that, I have to pass a data dict on Array creation. The current way to "inform" the aligner of what data types might be aligned depends on how the data dict keys are shaped. If I introduce logic to cover some sort of dict key shape that uniquely identifies that the alphabet should be list, the branching/nesting gets even more complex than it already is (pylint is complaining that the new method has 27 branches, where the max should be 12, as well as 59 statements, where the max should be 50).

Maybe the more robust solution would be to warn users when passing data types to the align function that will cause a failed alignment due to the different data type of the alphabet. Because as it stands now, my use-case ends up in the C code with no clear indication of what went wrong. When I initially found the bug, I spent a lot of time doing trial and error to figure out what the source of the bug was.

mdehoon commented 1 year ago

One quite simple fix for my use-case would be to remove the logic that collapses the alphabet into a single string if all the keys in the data dict are single letters. It doesn't cause any alignment tests to fail as far as I can tell, but it doesn't follow the requirement that the alphabet is expressed as a single string by default.

If the alphabet is a string, then the aligner knows that the sequences consist of single characters, and it can use the integer value of each character to find the index of the row and column in the scoring matrix.

If the alphabet is a list, then it can contain any kind of object. The aligner then needs to compare each item in the two sequences to each item in the alphabet to find the row and column index. This is much slower.

mdehoon commented 1 year ago

I think all that is needed is to modify Array.__new__ such that it does not require the alphabet to be None if data is a dictionary. Then some code needs to be added to confirm that alphabet and data are consistent in that case. If pylint complains, just ignore it.