Open RobertBaird opened 4 months ago
Hmm, this is a tricky question. Frankly, I don't think SNPsplit is suited for this kind of scenario (long, noisy reads) and there is little if any functionality currently to allow fine tuning in it's current form.
The majority of reads seem to get assigned to the 'conflicting' category. Maybe it is worth visualising this by using the --conflicting
option, and then looking at the reads in IGV? Maybe this can yield some clues if some additional post-processing could be applied?
Apologies if isn't very helpful, but the intended use-case really is short reads in combination with fully characterised mouse strains. It feels like the long read field should have some better solutions for allele-specific assignment?
Hey Felix, thanks for the response. It seems like I'm losing most of my reads to deletions of the N-masked sites, which arise from sequencing errors. I agree that it's definitely not suited to long noisy reads - maybe HiFi reads would work though. Unfortunately there doesn't seem to be much out there for allele-specific sorting of reads.
Cheers!
Hi, I have a question about how reads are assigned as conflicting, unassignable etc.
I've been trying to use SNPsplit on noisy PacBio long reads to help with haplotype-resolved assembly (which I know is a little optimistic), and I'm losing a lot of reads. Here's the allele-tagging report:
I'm expecting around 5% of reads to be assigned to each genome, so it looks like I'm losing a lot due to reads either not containing an expected base at known SNP positions (presumably because of single-base or insertion sequencing errors), or reads containing conflicting SNPs (presumably mainly due to sequencing error deletions of one of the N-masked sites).
Is there a way to tune these features so that e.g. a read can still be assigned to genome 1 if it contains a genome 1-specific SNP, but has something other than the genome 2 SNP at another N-masked position (like an obvious sequencing error)?
Thanks!