Kingsford-Group / splitsbt

GNU General Public License v3.0
18 stars 3 forks source link

Toy example request #9

Open lenksix opened 6 years ago

lenksix commented 6 years ago

Hi, i am trying to use the SSBT for a project, i've set up all the configurations but i'm still stuck with the query phase. I have experimented a lot of different ways to construct the tree changing the input files, but i still have the problem that during the query phase the results does not match (are wrong, or there are not at all), which is something very strange. Also the parameter of cutoff seems not so effective (i have to set it very low like 0.002 except for the case 3.) in order to have some results. The experiments i tried are:

  1. input files of length 5M, queries length 100;
  2. input files of length 5K, queries length 100;
  3. input files of length 100, queries of length 100.

So if there is the possibility to upload a little toy example it would be very useful. Thank you in advance.

Bradsol commented 6 years ago

I suspect that the problem you are having is that you have repeatedly reconstructed the tree on the same set of input files — as the construction step overwrites the files themselves the files you are likely operating with aren’t the same // don’t contain any meaningful data.

Your described parameters don’t make a lot of sense either. Input file length is what? The number of short reads? The length of a short read? The size of the bloom filter? Accuracy of the SSBT is also going to be dependent on the length of the query to some degree; performance may suffer with a small query length though 100 should be sufficient for reasonable recovery.

Your cutoff parameter also makes no sense in context — the cutoff parameter is a part of the count command which describes the minimum # of kmers necessary to keep in the bloom filter. This should be an integer. If you mean the threshold theta (% of query needed to match), 0.002 is virtually meaningless as 0.002*100 is 0.2 and any query which finds ONE kmer [at a potential 50% false positive rate] will return a hit. I can’t imagine this is the desired behavior.

The SBT website has a fully functioning index; I will look into adding a small example in the next commit.

lenksix commented 6 years ago

Thank you very much for the reply. I understand the description i gave is not accurate and i apologize for this. With string length i meant the size of the bloom filter (which i set to the quantitative of unique kmers of the experiments, k=20, as suggested on the guide), with the queries and reads having length 100 characters.

As cutoff i meant theta the threshold value, that also for me had a very strange behavior since i read the article and of course when i was not able to reach the expected result, so i thought i was doing something wrong.

I have already seen the example on the website but it is quite large so for this reason i asked if there was the possibility for a little toy example; thank you for saying that you will add one, i'll be waiting in order to understand the errors i made.