blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Calculating hamming distance requires uppercase fasta #20

Closed cassiawag closed 2 years ago

cassiawag commented 2 years ago

When using embed.py, if fasta is in lowercase 'actg' rather than uppercase 'ACTG', the distance matrix is calculated as all zeros and results in a poor clustering. This was an opaque error to troubleshoot since the distance matrix is not saved. I had to modify code to save the distance matrix & realize there was a bug in the calculation.

I just modified the get_hamming_distances function in Helpers.py code to add the lowercase letters as an appropriate option.

    # Define an array of valid nucleotides to use in pairwise distance calculations.
    # Using a numpy array of byte strings allows us to apply numpy.isin later.
    nucleotides = np.array([b'A', b'T', b'C', b'G', b'a', b't', b'c', b'g'])

    # Convert genome strings into numpy arrays to enable vectorized comparisons.
    genome_arrays = [
        np.frombuffer(genome.encode(), dtype="S1")
        for genome in genomes
    ]

I edited the line beginning with "nucleotides" above.

Alternatively, throwing an error if fasta is lowercase would have been helpful for troubleshooting.

nandsra21 commented 2 years ago

Just pushed a new release to pip that should fix this issue. (https://pypi.org/project/pathogen-embed/0.0.2/)