bailey-lab / SeekDeep

Bioinformatic Tools for analyzing targeted amplicon sequencing developed by Nicholas Hathaway of Bailey Lab
http://seekdeep.brown.edu/
GNU Lesser General Public License v3.0
13 stars 5 forks source link

--caseInsensitive #10

Open nikosdarzentas opened 4 years ago

nikosdarzentas commented 4 years ago

Hi Nick,

--caseInsensitive seems unresponsive, if I understand it correctly. I.e. if I try to activate it (it's false by default) and I have two sequences with one mismatch (an 'a' to 't' below, near the end) PLUS a 'g' to 'G' (i.e. just the case, 3rd position below), it doesn't cluster the sequences.

stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
9999999:0:0:0:0:0:1:0

@1;size=1
acGtACCCCCGTacgattttt
+
****HHHHHHHH*********
@2;size=1
acgtACCCCCGTacgtttttt
+
****HHHHHHHH*********

SeekDeep qluster --fastq qlutest.fastq --out qlutest --par myPar --noMarkChimeras --lower keep --caseInsensitive --smallReadSize 0 --useAllInput --writeOutInitalSeqs --overWrite --verbose --fastClustering --nucCutOff 0.2 --runCutOff 0%,0 --adjustHomopolyerRuns false --qualThresWindow 0

I had a quick look at the code and it seems hardcoded. Can you confirm? Or am I confused?

Many thanks.

nickjhathaway commented 4 years ago

So I think I changed it so this would work in the developmental version that I'm about to release to not mark it as an error though honestly, I've never tested out clustering with both cases still present, I know the consensus-building won't work properly as it count those as different no matter what is set, is there a reason you have to keep the lower case if you going to allow the cases to match?

nikosdarzentas commented 4 years ago

Hi Nick,

As I explained in the other issue about homopolymers, I use case (and qualities) to encode antigen receptor rearrangement junction information - so, in fact, I was thinking of (maybe) eventually penalising the same nt coming from different junction regions by penalising different cases. And this was my test to understand if and how this SeekDeep option works.

In general, SeekDeep's powerful option set is what got me to try it, because it would allow me to 'hack' it to have full control over what happens with my complex input set. To be honest, I've been stumbling over different bits and pieces (another I haven't reported, yet, has to do with the HQ/LQ mismatch and the --qualThres and --qualRep options...), which have put me off a bit - but trust me, I know how hard it is to keep all these options running smoothly esp. for a new user and use case. I'll keep following SeekDeep for a while, and I'll try and test the new version when it comes out - hopefully I'll be able to use it in the end.