Calculating hamming distance requires uppercase fasta

When using embed.py, if fasta is in lowercase 'actg' rather than uppercase 'ACTG', the distance matrix is calculated as all zeros and results in a poor clustering. This was an opaque error to troubleshoot since the distance matrix is not saved. I had to modify code to save the distance matrix & realize there was a bug in the calculation.

I just modified the get_hamming_distances function in Helpers.py code to add the lowercase letters as an appropriate option.

    # Define an array of valid nucleotides to use in pairwise distance calculations.
    # Using a numpy array of byte strings allows us to apply numpy.isin later.
    nucleotides = np.array([b'A', b'T', b'C', b'G', b'a', b't', b'c', b'g'])

    # Convert genome strings into numpy arrays to enable vectorized comparisons.
    genome_arrays = [
        np.frombuffer(genome.encode(), dtype="S1")
        for genome in genomes
    ]

I edited the line beginning with "nucleotides" above.

Alternatively, throwing an error if fasta is lowercase would have been helpful for troubleshooting.

blab / cartography

Calculating hamming distance requires uppercase fasta #20