bluenote-1577 / flopp

flopp is a software package for single individual haplotype phasing of polyploid organisms from long read sequencing.
33 stars 7 forks source link

Phasing of a diploid ONT assembly #1

Closed fergsc closed 2 years ago

fergsc commented 2 years ago

Hi, I am interested in using flopp to phase some diploid ONT plant genomes I have, or to identify and remove the least contigious haplotype if I don't have two complete haplotypes. I am unsure as to how the output would be interperated to perfrm this task.

To do this I would?

  1. Align my assembly reads to my genome (bam)
  2. variant call (vcf) and filter out variants that are sequencing errors (based on allele depth).
  3. run flopp on the genome with the produced bam and vcf files.
  4. seperate the two haplotypes/ remove the haplotype less represented within the contig set.
  5. thanks.

bluenote-1577 commented 2 years ago

Hi Scott,

FYI: while flopp can definitely be used for phasing diploid genomes and should work, we have not tested flopp extensively on diploid genomes.

Your general process looks fine. To separate haplotypes or remove the haplotype less represented, what I would do is to use the -P (directory) option for flopp, so something like flopp -p 2 -b yourbam.bam -v yourvcf.vcf -o out_file.txt -P directory_name.

The -P option outputs to the directory specified a set of text files corresponding to each contig that tells you which read is in which haplotype. The output from the -o option may give you some coverage information about how reasonable the phasing is. After that, I would reassemble the contigs using the reads output for each haplotype separately.

Here are some pointers to make sure your phasing looks reasonable

1) flopp is tested with long-reads (pacbio/oxford nanopore) only.

2) Since you are operating on a diploid genome, if you have long-reads, I would recommend longshot for calling variants: https://github.com/pjedge/longshot. See the "VCF requires contig headers" if using longshot, as the vcf output needs additional formatting.

3) I would suggest either WhatsHap or HapCUT2 which are specialized for diploid phasing if you have issues with flopp.

Let me know if you have any issues.

Jim

fergsc commented 2 years ago

Thanks for the very quick response Jim.

Sounds like I may have misinterpreted the purpose of flopp. flopp is made to pool reads, using a reference, into seperate haplotytpes? Using it in the manner I was suggesting may involve a little to much work, when compared to the existing tools (WhatsHap, HapCUT2).

This raises another possibliity, using flopp and a closly related genome to seperate out reads pre-assembly. I have come across other tools that do this using k-mer distributions. But if a close assembly exists flopp sounds to me like it would work better.

bluenote-1577 commented 2 years ago

Hi Scott,

Maybe I misunderstood your intentions and phrased something weirdly; flopp should output pretty much the same information as WhatsHap or HAPCUT2 in that the primary output is a sequence of alleles (i.e. SNPs for flopp) for each of the chromosomes that represents the haplotypes. This is the output of the -o command.

In addition to the above output, many phasers (flopp, WhatsHap) allow the user to partition reads into separate haplotypes. The reason this is useful is that the sequence of alleles may be missing some variants e.g. structural variants, so some people may choose to do reassembly or use the separated reads for visualizing data (using IGV or something like that).

Assuming you're assembling a diploid genome using ONT data, for many contigs barring large structural variations, you will probably get some collapsed/mashed up version of the two haplotypes. For these contigs which are a mix of the two haplotypes, I initially assumed what you meant by "separating the haplotypes" as obtaining two assembled versions of this contig -- one for each haplotype. If instead you just want the sequence of SNPs on the contig corresponding to certain haplotypes, the -o output is what you want.

Could you maybe explain a little more what you meant by "separating the two haplotypes"?

The use-case you mentioned is absolutely possible and a very nice one.

fergsc commented 2 years ago

We are talking about the same thing I think, just going about it differently. I shall try to explain what I am hoping to do.

In older assemblies (last year) we generally assembled 1.5x genomes with a lot of phase switching and small contigs. Small contigs generally represented bubbles (indels, translocations, etc) and other features of the assembly graph, and are often duplicted genomic regions. Or different SV on the parental chromosomes. The most common way of dealing with this was to align contigs against each other to identify duplicated regions, keep the longest contig and remove the shortest.

With newer and better ONT assemblies I am getting contigs with a minimum size of 30-40 Kbp and an assembled genome size of ~2x. This is due to more accurate basecalling and much longer and deeper reads going into assembly. My belief/hope is that these long contigs represent two fully assembled parental haplotypes. Now I am hoping to bin these contigs into parental genome bins not simply long and short.

I came across flopp and was hoping I could use it to bin my contigs based on haplotype information. Binning based on alignemnt and keeping the longest and more "contiguous" was a simple and efficent fix to the 1.5x genome problem. But it seems inappropiate when I am assembling genomes with long contigs and getting 2x assembled.

Hope this makes sense, and thanks for the dicsussion.

bluenote-1577 commented 2 years ago

I see, thanks for explaining!

Software like flopp is meant for deducing haplotypes for a single collapsed contig rather than binning contigs, so it is probably not what you want here.

It sounds like what you want is something like genome scaffolding, which from my out-dated knowledge of de novo assembly seems to be mostly done by Hi-C nowadays. Here is a Hi-C polyploid scaffolder that I've seen mentioned: https://github.com/tangerzhang/ALLHiC/wiki.

There are long-read scaffolders out there but it's not obvious if that would be any of help since I am not sure how haplotype sensitive these are.

If there are collapsed regions in your assembly, then in theory a phaser like flopp can bin phased (i.e. haplotype representative) contigs by looking at alignments of long-reads from collapsed regions to the phased regions and deducing which haplotype the long-read came from (since flopp outputs which haplotype the long-reads are from). So you can link phased contigs across collapsed regions by looking at these read alignments. However, this is not implemented. Perhaps someone will come up with software which can do phasing, scaffolding, and binning all at the same time in the future :).

fergsc commented 2 years ago

Thanks for the discussion, it was helpfull.

I can see uses for a tool such as flopp. Assessing the level and location of haplotype switching within contigs, helping to better understand how this affects SV and SNP calling.

Hi-C is another ball of issues that I am trying to work through.

Thanks.