FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
51 stars 19 forks source link

Question about assigning 'unassignable' reads #79

Open RobertBaird opened 4 months ago

RobertBaird commented 4 months ago

Hi, I have a question about how reads are assigned as conflicting, unassignable etc.

I've been trying to use SNPsplit on noisy PacBio long reads to help with haplotype-resolved assembly (which I know is a little optimistic), and I'm losing a lot of reads. Here's the allele-tagging report:

Processed 7590179 read alignments in total
Reads were unaligned and hence skipped: 264205 (3.48%)
5789031 reads were unassignable (76.27%)
111956 reads were specific for genome 1 (1.48%)
69824 reads were specific for genome 2 (0.92%)
155723 reads did not contain one of the expected bases at known SNP positions (2.05%)
611385 contained conflicting allele-specific SNPs (8.05%)

I'm expecting around 5% of reads to be assigned to each genome, so it looks like I'm losing a lot due to reads either not containing an expected base at known SNP positions (presumably because of single-base or insertion sequencing errors), or reads containing conflicting SNPs (presumably mainly due to sequencing error deletions of one of the N-masked sites).

Is there a way to tune these features so that e.g. a read can still be assigned to genome 1 if it contains a genome 1-specific SNP, but has something other than the genome 2 SNP at another N-masked position (like an obvious sequencing error)?

Thanks!

FelixKrueger commented 4 months ago

Hmm, this is a tricky question. Frankly, I don't think SNPsplit is suited for this kind of scenario (long, noisy reads) and there is little if any functionality currently to allow fine tuning in it's current form.

The majority of reads seem to get assigned to the 'conflicting' category. Maybe it is worth visualising this by using the --conflicting option, and then looking at the reads in IGV? Maybe this can yield some clues if some additional post-processing could be applied? Apologies if isn't very helpful, but the intended use-case really is short reads in combination with fully characterised mouse strains. It feels like the long read field should have some better solutions for allele-specific assignment?

RobertBaird commented 3 months ago

Hey Felix, thanks for the response. It seems like I'm losing most of my reads to deletions of the N-masked sites, which arise from sequencing errors. I agree that it's definitely not suited to long noisy reads - maybe HiFi reads would work though. Unfortunately there doesn't seem to be much out there for allele-specific sorting of reads.

Cheers!