IndexThePlanet / LoganSearch

A k-mer search engine for all Sequence Read Archive public accessions
https://logan-search.org
18 stars 0 forks source link

What caused these false positives in a search with a very short query? #4

Closed Laura-Alex closed 4 days ago

Laura-Alex commented 5 days ago

Hello! It's me again. Trying to understand what went wrong with my search (false positives?) and, in the process, figure out a bit more of the theory.

Search ID: https://logan-search.org/dashboard/kmviz-433129fb-c201-40a8-9a79-47696dececa7

This time I ran LOGANSearch with a 32bp query. It comes from a conserved area which seems to be specific to my gene of interest in a certain organism. (GGCGTCATTCCAGACGATGTGACCTTGGCAAC)

I figured that there would be 2 31-mers for this query, and 3 possible options of k-mer coverage (1 - both k-mers are present; 0.5 - on either end, there is a single mismatch; 0 - everything else that gets thrown away). That was not quite the case, because my main kmer coverages appear to be 1 and 0.667, with a minority going from 0.512 to 0.56, but I figured it was close enough and could start by focusing on the '1s' anyway.

I assumed that 'location_avg_coverage' would be the average of kmer coverages for hits from one location (location in term of SRR/ERR dataset). So I decided to focus on the results where kmer_coverage = 1 AND location_avg_coverage = 1. Within those, I started with the samples that had the highest location_count.

SRR6130088 - count 30 SRR26143929 - count 11 ERR3514425 - count 5 ...

Now my issue is, I cannot find my sequence in those datasets. (The Contigs/Unitigs search didn't return any results, but then it almost never worked for me anyway). I ran BLAST searches through the NCBI website (with parameters slightly adjusted for a short sequence), and I also used the 'Filter' option in the SRA Run Browser to look for the sequence of interest/subsets of the sequence of interest in those datasets.

I guess it is not impossible that unitigs were built which carry information found in no single read, but that appears unlikely as the first two datasets are 16S amplicon sequences (and the third is an amplicon dataset of an unspecified gene). So there should not be much variation to start with. Maybe I am missing something obvious, but I would appreciate your help in understanding what went wrong, so I can better grasp the limits of the software and best practices to follow.

pierrepeterlongo commented 4 days ago

Hello @Laura-Alex

It seems that the session you mentioned https://logan-search.org/dashboard/kmviz-433129fb-c201-40a8-9a79-47696dececa7 was not made with query GGCGTCATTCCAGACGATGTGACCTTGGCAAC.

This explains the discrepancy with the obtained results.

Also, we use bloom filters. So the false positive rate is non-nul. With the query you indicated we obtain 1405 results (https://logan-search.org/dashboard/kmviz-bd39515e-db32-4a7d-9a7f-df7b0b308487) This is roughly 0.005% of the indexed samples. This is expected and in line with the expected false-positive rate.

Your question highlights the fact that this search index faces limitations while querying small sequences.