claczny / VizBin

Repository of our application for human-augmented binning
27 stars 14 forks source link

Why does the confirmation for number of kmers NOT always pop-up? #37

Closed ashishdamania closed 8 years ago

ashishdamania commented 8 years ago

Below is the attached screenshot. Also, it seems that it the number of kmers shown is not equal to theoretical max: kmers=Total Length +1 -K (Kmer length). Can you please check it? I can upload my test file if required.

Thanks for making VizBin.

screen shot 2016-06-19 at 5 37 14 pm

claczny commented 8 years ago

Hi,

this particular dialogue is meant to inform the user about the amount of non-default-DNA-alphabet letters in the provided sequences. Specifically, how many kmers were affected by such letters, e.g., N in ACGNT. Accordingly, the dialogue will only appear if there is at least one non-default-DNA-alphabet letter in your retained sequences, i.e., sequences equal to or longer than the specified minimum length (default of 1,000nt) . More generally speaking, each such letter will affect k k-mers in the worst case.

It is not unexpected to have some kmers being ignored and in your particular case, the frequency is really low, so everything should be fine.

Regarding

the number of kmers shown is not equal to theoretical max: kmers=Total Length +1 -K (Kmer length)

this formula is correct. If I understand your question correctly, "Total Length" should then represent the cumulative length of the retained sequences in your provided FASTA file. Does it not match that number in your case?

Hope that helps and thank you for your interest in VizBin.

Best,

Cedric

ashishdamania commented 8 years ago

Hi Cedric, Thanks for the detailed and prompt response. Now it makes sense why I do not get that pop up information. 1) Is it possible to add the information about kmers in the log or pop-up regardless if the sequences contain N or other unexpected characters? 2) Also, I see that the kmer length that is reported in the pop-up is not consistent with the formula above. For example, I tried EssentialGenes.fa from the data directory and added two N in the sequence so that I could get a pop-up and I see that there 571706 kmers with K=5 but the sequence length is 574002 which should give us 574002+1-5=573998.

I calculated the length of the EssentialGenes.fa as follows:

grep -v ">"  EssentialGenes.fa > EssentialGenes_reformatted.fa

bioawk -c fastx '{print $name,length($seq)}' < EssentialGenes_reformatted.fa 

Does it discount kmers based on some criteria? Sorry, I am not getting this total correctly. Again, thanks for the response and for making VizBin.

Ashish

screen shot 2016-06-20 at 9 18 58 am

claczny commented 8 years ago

Hi Ashish,

1) Is it possible to add the information about kmers in the log or pop-up regardless if the sequences contain N or other unexpected characters?

frankly speaking, this feature is meant as a small reminder that something might need consideration within the data. It is not meant to serve as a proper/fullscale validity check, which, in any case, should occur prior to binning the data. If there is no pop-up, the better ;)

2) Also, I see that the kmer length that is reported in the pop-up is not consistent with the formula above. For example, I tried EssentialGenes.fa from the data directory and added two N in the sequence so that I could get a pop-up and I see that there 571706 kmers with K=5 but the sequence length is 574002 which should give us 574002+1-5=573998.

This calculation would be correct if we had a single sequence of that length. However, we have 574 separate sequences in EssentialGenes.fa. Hence, the formula should be

(1000+1-5)*574

which equals to 571704, since every sequence is 1,000 bp long. If one was to add 2 N's, the number would be 571706, i.e., as displayed by VizBin. So everything is in order there.

Consequently, I consider this issue closed. Feel free to continue posting questions/comments if they are related to this issue. Otherwise, please open a new ticket describing the situation at hand.

Let me know if there is anything else I can be of help with.

Best,

Cedric