HingeAssembler / HINGE

Software accompanying "HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution"
http://genome.cshlp.org/content/27/5/747.full.pdf+html?sid=39918b0d-7a7d-4a12-b720-9238834902fd
Other
64 stars 9 forks source link

How do I deal with genome assemblies that are 2x the correct size? #148

Closed SchwarzEM closed 6 years ago

SchwarzEM commented 6 years ago

I've run HINGE on two data sets so far: the example data set for E. coli, and a set of PacBio reads for C. elegans. In both cases, I've gotten an assembly FASTA file that was almost exactly 2x the size of the actual haploid genome of the organism being assembled.

For E. coli this was not a big deal, because the total assembly sequence (ecoli.consensus.fasta) had two contigs of 4,650,727 nt and 4,649,782 nt, and it could be easily discerned that these were likely to be one-contig-per-genome alternative assembly products.

For C. elegans, however, the 207.5 Mb assembly (2x the correct size) has 446 contigs! How, exactly, am I supposed to be able to sort out which 50% of these contigs are the optimal selection for a 1x-genome sequence set?

govinda-kamath commented 6 years ago

We maintain the invariant that the even contigs and the odd contigs form the two (reverse complemented) assemblies.

This script should do the job for you.

SchwarzEM commented 6 years ago

For whatever reason, get_single_strand.py would only give me Consensus0. However, I coded a Perl script that actually does give me all of the even-numbered contigs. Usage:

cat XXX.consensus.fasta | ./get_single_hinge_strand.pl 1>XXX.assembly.fa 2>test1.err ;

Perl code:

#!/usr/bin/env perl
# get_single_hinge_strand.pl -- Erich Schwarz <ems394@cornell.edu>, 3/8/2018.
use strict;
use warnings;
use autodie;
my $print = 0;;
while (my $input = <>) {
    chomp $input;
    # make decision whether to be in printing mode or not
    if ( $input =~ /\A [>] \S+ (\d+) \z/xms ) { 
        my $index = $1;
        $index    = ($index / 2);
        if ( $index == int($index) ) {
            $print = 1;
        }
        else {
            $print = 0;
        }
    }
    if ($print) {
        print "$input\n";
    }
}

Given that, I did indeed get the haploid assembly I needed. Thank you for the explanation!

Was I obtusely failing to see this explained in the documentation? If not, could it be added to the README?

jwasmuth commented 6 years ago

I'm a touch confused - for my test data (before selecting even numbered contigs), I end up with an odd number. The .consensus.fa is twice the expected size, which is encouraging. Not sure what's going on with the N%2=1 though. Ideas? Perhaps, I'm missing something. Thanks James

SchwarzEM commented 6 years ago

What test data set were you using?

When I ran HINGE on HINGE/demo/ecoli_demo, I got the result "ecoli.consensus.fasta", which consisted of 2 contigs (4,650,727 nt and 4,649,782 nt). So no N%2 = 1 problem. Did you have a different result on ecoli_demo, or were you using different test data?

jwasmuth commented 6 years ago

Hi Erich, I get a similar result as you with ecoli_nanopore. For my current problem, I was using my own data from Giardia, technically a tetraploid, but I assume that HINGE is going to ignore this fact. From the message from @govinda-kamath, I assumed that I'd get N%2==0 as well. Especially, since the assembly size is twice my expectation. James

jwasmuth commented 6 years ago

Me again, I had a look at the lengths of the contigs, as 'paired' contigs should have similar sizes. At Consensus140, this happens:

Consensus138 1312366 Consensus139 1311923 Consensus140 12228 Consensus141 130444 Consensus142 130342

I'm not sure if this is due to aggressive_pruning option I used. I'll check and report back.

James

govinda-kamath commented 6 years ago

It looks like you had an inverted repeat of 12k which was unbridged from the looks of this. The odd-even invariant we have breaks if there is an inverted repeat.

jwasmuth commented 6 years ago

Hi @govinda-kamath, Thanks for the reply. Is this inverted repeat recorded somewhere, which would allow me to skip the contig when putting together a haploid representation of the assembly in a fasta file? Thanks James

govinda-kamath commented 6 years ago

You could visualise the .gfa output using bandage to confirm this.

jwasmuth commented 6 years ago

Thanks, I was hoping for an automated way, for example in a log file.