SamStudio8 / gretel

An algorithm for recovering haplotypes from metagenomes
https://gretel.readthedocs.io/en/latest/
MIT License
31 stars 4 forks source link

[tracking] Ideas for future work #27

Open SamStudio8 opened 5 years ago

SamStudio8 commented 5 years ago
alexcritschristoph commented 5 years ago

:+1: to paired-ends... I currently am engaged in a 1 man boycott against all programs that don't utilize all paired read information on principle but really do want to try this out. I think PySAM reads in paired reads with the same read_name I believe so what we do is track read pair to genotype relationships that way

SamStudio8 commented 5 years ago

@alexcritschristoph Yes, we really ought to be using paired-end information better, and yup, pysam does given information on pairs. Not using this information is not a decision taken lightly and the reasoning is mostly technical: firstly, we parse the BAM with pileup and not fetch, which means we're orientated by position, not by the actual reads. We found that this was faster for our use-case. Additionally, we divide the chosen region into windows, which are read in parallel, and the pairs are not guaranteed to be within the same window. The Hansel matrix is large, and only stores information about the reads in aggregate, rather than any read specifically. This means that with the current design, it's hard to associate reads that are in pairs, without an expensive thread-safe structure that has to pair the reads before they can be inserted into the matrix.

In addition to this, it's unclear (ie. I don't know) if the information will be useful due to the way the algorithm works. When predicting a next SNP to choose, Gretel considers some L previous positions, which is decided primarily by the average number of SNPs that a parsed read covers. If the pairs are sufficiently separated, its likely that the automatically chosen value for L will probably be smaller than the distance between the end of the first read and the start of its mate anyway.

Nevertheless, I was thinking about this recently recently and I thought that perhaps the easiest way to get around the issue would be to pre-process the pairs and join them together (with a gap; or Ns, or whatever); then align them to the pseudo-reference. They'll be parsed as single reads, but the paired information will find its way to the matrix (assuming the alignment step can cope with the large gap, which is another problem in itself, of course).

But to be clear; it's not that we haven't thought about it!

alexcritschristoph commented 5 years ago

Cool, we read in with pileup too and just create a lookup table for each read name, and because pairs share names pair information gets combined efficiently. We also work on a 1 thread per contig level (gets slow on big contigs, but still pretty reasonable times at <1000x coverages) to get around the issue you describe. We construct a big SNV-graph for each contig (where nodes are SNVs like 131:A and they are connected to other nodes by the number of read-pairs that contain both). I think it contains all of the information you'd want to do the fancy math on. We can generate that object very quickly - within a few minutes for most contigs.

I am worried about paragraph 2 - shouldn't it be straightforward that in reconstructing haplotypes, more than doubling your ground truth haplotype lengths should be extremely useful? (result should be >2x as confident). E.g if you had reads connecting A -> B and B -> F

you can do the probability to calculate how often A is to go with F, but it's just a probability. But if you have pairs that span A to F you simply know the true answer! I'd be really worried if a program didn't guarantee output of observed A-F haplotypes simply because there were B, C, D, E variants between them...