changing minDR and maxDR give unexpected output

michberr commented 8 years ago

Hi Connor, I have a subset of some data in the file test.txt. Each sequence in this fasta file contains one of two DR's which differ by only a single bp GTTCCAATTAATCTTAAACCCTACTAGGGATTGAAAC vs GTTCCAATTAATCTTAAACCCTATTAGGGATTGAAAC (they were "grep-ed" from a metagenome).

Since these two DR's are 37 bp long i adjusted minDR and maxDR like so (crass will throw an error if they are equal):

crass -d 37 -D 38 test.txt

When i do this, CRASS finds 0 CRISPRs. However, if I leave minDR and maxDR at their defaults of {23,47}, CRASS finds 3 CRISPRs and two of them are identical. When minDR and maxDR are {27,40}, CRASS finds 1 CRISPR. Ironically, none of these are the correct number of CRISPRs in this dataset.

There are two issues here: 1) Can you explain to me why adjusting the minDR and maxDR affects CRASS' ability to find CRISPRs even when the DR's are within the range set by the arguments? Is there another parameter i need to think about adjusting if I am adjusting minDR and maxDR?

2) CRASS will sometimes create duplicate DR groups if the clusters converge on the same consensus DR. This is a fairly undesirable outcome.

test.txt

Thanks, Michelle

ctSkennerton commented 8 years ago

Hi Michelle,

This is very concerning and obviously not the correct behavior. I'll try to look into this quickly.

Connor

ctSkennerton commented 8 years ago

Hi Michelle,

I've looked in a bit at your observations and I can at least answer part of your questions. First, I think part of the problem that the -d and -D options are used for two different things in the code. The first thing Crass does is to try and find short repeated subsequences in a read separated by at least the value of the -s option (minimum spacer length). It then extends those short subsequences as far as it can as long as they keep matching. After this stage crass then checks whether the repeated subsequences are within the acceptable size range (defined by -d and -D) before continuing on to other tests and the rest of the program.

Now the repeated subsequences in an individual read are almost never (at least with 100bp data) going to contain the full sequence of the direct repeat so setting -d and -D to the exact size you want the final repeat to be in this case is a bad idea since 37*2 + 26 = 100, which happens to be your read length. The chance that two full copies of the direct repeat are going to be in the read is very low and that's assuming that the spacers are 26bp - if they are longer then it will be impossible. This is why you get nothing with the parameters -d 37 -D 38

A lot of the logic in crass is dedicated to clustering reads that appear to contain various partial matches to the same direct repeat and trying to determine the proper boundaries of the repeat and spacer.

Now I'm realizing that this behavior is not clear as the -d and -D are also used later on in Crass after the full length of the direct repeat is determined to also throw out any putative crisprs that fall outside of this size range. Which I guess is what you were expecting when setting these options.

I'm still looking into your other problems.

ctSkennerton commented 8 years ago

Hi Michelle,

I've just uploaded some new code on the the github repo which fixes your second bug/question. Crass should no longer output duplicate DR groups.

I also think I understand why changing the DR sizes affects the number of CRISPRs. Basically what happens is that the many partial DR sequences that are initially found get grouped using a simple single-linkage clustering algorithm that is quite parameter sensitive. Changing the maxDR and minDR options affects what partial DRs are found and therefore how those DRs get clustered together, which then flows into the rest of the program. I don't think this is any easy fix, I simply don't have the time to work out a better way of doing this, so this bug will remain for the foreseeable future.

michberr commented 8 years ago

Hi Connor,

Thanks for looking into these issues. I'm glad that the duplicate DR groups bug has now been fixed.

It seems like there should be separate arguments for the initial DR size search (which would vary optimally based on the length of your read) and the final DR cleaning step. This seems like it wouldn't be too hard to implement. At the very least, the documentation should make what the existing parameters do clearer.

The harder issue is what to do about the DR clustering. At least for my dataset, separating out these two highly similar DR types is essential. It would be ideal to have some separate parameter that would control the sensitivity of the clustering. Not quite sure how to do this, but i'll think about it some more.

I understand this software isn't your priority right now. I can try to contribute, at least on the documentation end in the next couple of weeks.

ctSkennerton / crass

changing minDR and maxDR give unexpected output #81