When using embed.py, if fasta is in lowercase 'actg' rather than uppercase 'ACTG', the distance matrix is calculated as all zeros and results in a poor clustering. This was an opaque error to troubleshoot since the distance matrix is not saved. I had to modify code to save the distance matrix & realize there was a bug in the calculation.
I just modified the get_hamming_distances function in Helpers.py code to add the lowercase letters as an appropriate option.
# Define an array of valid nucleotides to use in pairwise distance calculations.
# Using a numpy array of byte strings allows us to apply numpy.isin later.
nucleotides = np.array([b'A', b'T', b'C', b'G', b'a', b't', b'c', b'g'])
# Convert genome strings into numpy arrays to enable vectorized comparisons.
genome_arrays = [
np.frombuffer(genome.encode(), dtype="S1")
for genome in genomes
]
I edited the line beginning with "nucleotides" above.
Alternatively, throwing an error if fasta is lowercase would have been helpful for troubleshooting.
When using embed.py, if fasta is in lowercase 'actg' rather than uppercase 'ACTG', the distance matrix is calculated as all zeros and results in a poor clustering. This was an opaque error to troubleshoot since the distance matrix is not saved. I had to modify code to save the distance matrix & realize there was a bug in the calculation.
I just modified the
get_hamming_distances
function inHelpers.py
code to add the lowercase letters as an appropriate option.I edited the line beginning with "nucleotides" above.
Alternatively, throwing an error if fasta is lowercase would have been helpful for troubleshooting.