Megamash cannot call complicated sequences

Koeng101 commented 8 months ago

seq_recovery=0.4152_0 for example, cannot be processed into a megamash table because it doesn't have any unique 16mers. This is because of a flawed assumption: ESPECIALLY in protein variant libraries, 16mer windows won't necessarily be unique - and even if a sequence is unique, a 16mer sliding window may not be able to pick it up. This is a real issue.

I'm still trying to figure out how to fix this.

templateMap.csv

CamelCaseCam commented 8 months ago

I'm just throwing out ideas here, but what if you design a hash that takes into account the history of the sequence, but with less and less importance the further you move further along? I'm thinking something like the following:

hash_n = hash(chars[n]) + (hash_n-1 >> 1)

I think something like this would take much larger blocks of nucleotides into account, while making it so that nucleotides that are farther back don't have as much of an effect (so the core idea of the algorithm should still work)

If you decided to go with something like this, I think an even better solution would be combining k-mers and this rolling hash. You'd just have to check twice as many hashes, which may not be a problem with a bloom filter + CUDA

Koeng101 commented 8 months ago

If you decided to go with something like this, I think an even better solution would be combining k-mers and this rolling hash. You'd just have to check twice as many hashes, which may not be a problem with a bloom filter + CUDA

The thing is that the importance does not diminish as you go along - if anything, it would increase. So I don't think a rolling hash is the answer.

I think the answer might be in suffix arrays. Ie, https://github.com/bebop/poly/issues/42 and https://github.com/bebop/poly/issues/15 , but applied to this problem.

CamelCaseCam commented 8 months ago

That was my bad - I forgot to include the masking. You'd obviously need to mask it so that the least significant bit disappears. I'm actually testing this right now because I feel like it could still work

CamelCaseCam commented 8 months ago

Okay I tested it and it does converge back to the same hash after a certain number of characters. Here's my test script - converged after an average of 32 chars, maximum 47 chars out of 100.

from hash import hash_string
import random
import string

def test_hash():
    # Generate a random string of length 10
    s = ''.join(random.choices(string.ascii_letters, k=100))
    # Store the hashes of the original string
    orig_hashes = hash_string(s)
    # Mutate the first letter of the string
    s = (random.choice(string.ascii_letters)) + s[1:]
    # Store the hashes of the mutated string
    mut_hashes = hash_string(s)
    # Compare the hashes and count how long it takes for them to become equal
    equal_at = None
    for i, (o, m) in enumerate(zip(orig_hashes, mut_hashes)):
        if o == m:
            equal_at = i
            break

    return equal_at if equal_at is not None else len(s)

# Define n and results
n = 1000  # Number of tests to run
results = []  # List to store the results

# Run the test n times
for _ in range(n):
    equal_at = test_hash()
    results.append(equal_at)

# Calculate and print the average, maximum, and minimum amount of time
times = [result for result in results if result is not None]
print(f"Average: {sum(times) / len(times)}")
print(f"Maximum: {max(times)}")
print(f"Minimum: {min(times)}")

CamelCaseCam commented 8 months ago

hash_string is in another file and is just a dummy hash function:

def hash_string(s):
    hashes = [hash(s[0])]
    for n in range(1, len(s)):
        hashes.append((hash(s[n]) + ((hashes[n - 1] >> 1) & 0x7FFFFFFF))) # clear LSB when adding
    return hashes

CamelCaseCam commented 8 months ago

After moving to a 64-bit hash, I get an average convergence in 64 chars, maximum in 90. It seems that for a hash of k bits, this approach converges back to the same value after an average of k iterations, maximum of about 1.4k

Koeng101 commented 8 months ago

Hey, the other option I just thought of - don't force the kmer to only occur once. Just make it appear as few times as possible, and any kmer that is chosen, have a system to penalize kmers that match across identifiers too often.

This lets you do the classic bloom filters like you were talking about before, and doesn't require a rolling hash, just some additional upfront computation.

CamelCaseCam commented 8 months ago

Would that solve the issue of distinguishing between sequences that don't have unique k-mers, though?

Koeng101 commented 8 months ago

It does not.

I think the best option might just be to do two-pass. First, compile all potential templates for a given well (based off of small megamashes from the oligos, which are sufficiently unique), then either do another megamash on top with just the potential templates or just do multiple sequence alignments, and choose the best one.

In that case, you're looking for much smaller oligos, and can usually get a unique kmer. Even in the case that you cannot get a unique kmer, the composition that you're looking for (potential target templates) is sufficiently limited that it doesn't really matter.

CamelCaseCam commented 7 months ago

Hey - sorry for the delay, I was off on reading week and focusing on seeing family. It sounds like a CUDA implementation of megamash would still be useful, so I'll do that. I'm going to include a flag in the program to use a k-mer hash vs a rolling hash (or both), since I still think the rolling hash could work

Can you send me some test files + the output from megamash on those test files? That way once I finish the algorithm I'll know it's working

Koeng101 / dnadesign

Megamash cannot call complicated sequences #64