bcgsc / btllib

Bioinformatics Technology Lab common code library
Other
23 stars 5 forks source link

ntHash with a Phred filter #107

Open jwcodee opened 1 year ago

jwcodee commented 1 year ago

Enhancement: I am thinking we could have a ntHash iterator with a phred filter. basically we could add a member which is a rolling sum of the Phred quality rather than the average. Every time we compute the k-mer hash, we check the quality score. If pass, we output, otherwise roll to the next hash.

parham-k commented 1 year ago

Good idea, I like it. Is it something we'd like to implement directly in ntHash though? I think it can be a separate class (as a template, e.g., PhredFilter\<NtHash>) that rolls ntHash and updates the Phred sum in each iteration.

jwcodee commented 1 year ago

Yeah that would be my approach too. The new class checks the phred sum / avg and rolls the nthash iterator as necessary. Are you thinking of template because of seednthash?

The other thing I'm not 100% on is if we should do rolling sum or keep an array of quality score and make sure no single quality score is below the threshold.

parham-k commented 1 year ago

Are you thinking of template because of seednthash?

Yeah basically I was thinking that since we have four ntHash classes, it'd be unusual to have a new feature implemented in only one of them. Although I'm not sure if Phread scores in spaced seeds make sense.

The other thing I'm not 100% on is if we should do rolling sum or keep an array of quality score and make sure no single quality score is below the threshold.

For that we can just keep a rolling minimum.

jwcodee commented 1 year ago

Are you thinking of template because of seednthash?

Yeah basically I was thinking that since we have four ntHash classes, it'd be unusual to have a new feature implemented in only one of them. Although I'm not sure if Phread scores in spaced seeds make sense.

Aren't they all the same nthash class under the hood with different constructor? We could just have four different constructor for phredntHash too

The other thing I'm not 100% on is if we should do rolling sum or keep an array of quality score and make sure no single quality score is below the threshold.

For that we can just keep a rolling minimum.

Hmm don't think that can work because what happens if you have multiple bases that violate your threshold. I think the best way is to skip k bases when you encounter low quality bases.

parham-k commented 1 year ago

Are you thinking of template because of seednthash?

Yeah basically I was thinking that since we have four ntHash classes, it'd be unusual to have a new feature implemented in only one of them. Although I'm not sure if Phread scores in spaced seeds make sense.

Aren't they all the same nthash class under the hood with different constructor? We could just have four different constructor for phredntHash too

They have the same methods, but there's no OO relationship between them. They're more like four separate classes, so we'd have to implement something like SeedPhreadNtHash and also two blind versions-BlindPhreadNtHash and BlindSeedPhreadNtHash :D

I'm more or less on the template class side, it'll be much cleaner. We can chat about this more, your call eventually.

The other thing I'm not 100% on is if we should do rolling sum or keep an array of quality score and make sure no single quality score is below the threshold.

For that we can just keep a rolling minimum.

Hmm don't think that can work because what happens if you have multiple bases that violate your threshold. I think the best way is to skip k bases when you encounter low quality bases.

I was thinking of the range minimum query (RMQ) problem, but since we're sliding a window, we should be able to use a deque even in a case of multiple low-quality bases (e.g., this solution?)

jwcodee commented 1 year ago

Are you thinking of template because of seednthash?

Yeah basically I was thinking that since we have four ntHash classes, it'd be unusual to have a new feature implemented in only one of them. Although I'm not sure if Phread scores in spaced seeds make sense.

Aren't they all the same nthash class under the hood with different constructor? We could just have four different constructor for phredntHash too

They have the same methods, but there's no OO relationship between them. They're more like four separate classes, so we'd have to implement something like SeedPhreadNtHash and also two blind versions-BlindPhreadNtHash and BlindSeedPhreadNtHash :D

I'm more or less on the template class side, it'll be much cleaner. We can chat about this more, your call eventually.

In that case, let's do template then.

The other thing I'm not 100% on is if we should do rolling sum or keep an array of quality score and make sure no single quality score is below the threshold.

For that we can just keep a rolling minimum.

Hmm don't think that can work because what happens if you have multiple bases that violate your threshold. I think the best way is to skip k bases when you encounter low quality bases.

I was thinking of the range minimum query (RMQ) problem, but since we're sliding a window, we should be able to use a deque even in a case of multiple low-quality bases (e.g., this solution?)

This will need some discussion too since we need to decide which one to go with. I do worry that on a base lvl it might be too stringent.

Of course we could implement both