Parsoa / SVDSS

Improved structural variant discovery in accurate long reads using sample-specific strings (SFS)
MIT License
42 stars 4 forks source link

detecting SFSs from short-read unitigs #4

Open dcopetti opened 3 years ago

dcopetti commented 3 years ago

Hello, I wonder if the method to detect SFS strings could be applied to other highly accurate datasets such as unitigs created from Illumina reads (e.g. PE, 2x150 bp, 500 bp insert size). In the preprint you mention you use BCALM2 to produce unitigs (from k-mers, OK), but I wonder if any type of "long" (>500 bp?) sequence with very little errors will also result in SFs like the (corrected) HiFi reads. Or does PingPong need to have read redundancy (i.e. HiFi read coverage) to detect SFSs? I wonder if for example the reference can be an assembly/unitigs and the query HiFi reads, and such. Thanks,

Dario

ldenti commented 3 years ago

Hi, thanks for your question.

We kept our implementation as general as possible: you can use it on any set of strings, as long as they are in FASTA/Q format (note that the argument is --fastq but you can also pass a FASTA). So for example you can index a set of unitigs and then look for SFSs in a read sample (or another set of unitigs).

The low-error rate is not a must but it helps in getting less specific strings (results are more precise): if you are working on read sample, any sequencing error may result in one or more SFSs.

Finally, you have to adjust the --cutoff parameter to your needs: if you search for SFSs in a sample against, for example, the index of the reference (or a set of unitigs) you expect to see the same SFS multiple times in the sample (thanks to coverage). If you instead are looking for SFSs in a set of unitigs you would expect to see an SFS less time (so you can set --cutoff 1).

I hope this helps you.

Best, Luca

ldenti commented 2 years ago

Closing.

LYC-vio commented 1 year ago

Hi, @ldenti ,

Is --cutoff parameter still in SVDSS? I cannot find it in the help information of SVDSS v1.0.5 binary

Thank you

ldenti commented 1 year ago

Hi @LYC-vio, no, the cutoff parameter is not there. We decided to remove it since it wasn't that useful and its behaviour was unpredictable.. In v1.0.5 the search mode computes and outputs all specific strings it can find - even those occuring only one time.

The --cutoff parameter wasn't reliable since we realized that you cannot expect to have the exact same sequence for the specific strings coming from the same locus. Maybe two specific strings (from two reads coming from the same locus) are just shifted by 1 base and they are not compared. I believe that if you/we need to have some sort of specific counting, we need to add a way to cluster the specific strings by similarity and then reports only clusters whose size greater than the threshold.. But this may be too application-dependent.

What do you need the --cutoff parameter for? Maybe we can come up with a solution.