bcgsc / abyss

:microscope: Assemble large genomes using short reads
http://www.bcgsc.ca/platform/bioinfo/software/abyss
Other
307 stars 106 forks source link

Stats is different when ABYSS-P is used #19

Closed ikitayama closed 10 years ago

ikitayama commented 11 years ago

Here are the Statistics on the suggested test data set downloaded from the bcgsc site:

Non-MPI result is below:

n n:500 n:N50 min N80 N50 N20 max sum 817 108 38 509 642 905 1323 2260 93364 test-unitigs.fa 497 80 19 500 1418 2969 5099 10702 173468 test-contigs.fa 421 26 6 578 6166 10350 20266 27989 180109 test-scaffolds.fa

4-node parallelism provides a bit different stats:

n n:500 n:N50 min N80 N50 N20 max sum 817 108 38 509 642 905 1323 2260 93364 test-unitigs.fa 499 80 19 500 1407 2969 5099 10702 173017 test-contigs.fa 421 25 5 578 6166 10392 27989 30442 179959 test-scaffolds.fa

(2- and 8-node parallelism gave the same stats)

V1.3.5 and Apps are built on the x86 with the recommended packages.

sjackman commented 11 years ago

Thanks for the bug report, Itaru. That's not supposed to happen. I'm heading to ISMB in Berlin tomorrow, but I'll see if I can replicate this when I get back. Can you post the exact command lines that you used?

ikitayama commented 11 years ago

Thank you, Shaun - here it is:

abyss-pe (np=8) v=-v k=25 name=test in='test-data/reads1.fastq test-data/reads2.fastq'

as explained in the Manual.

sjackman commented 11 years ago

Thanks. Can you post your log files in a gist?

ikitayama commented 11 years ago

Pushing to gist doesn't work. I sent them to you via email. Would you take a look at it?

sjackman commented 11 years ago

I believe @traymond is looking into this issue.

ikitayama commented 11 years ago

Does @traymond also have the logs? In any event, the data set is small enough to reproduce the discrepancy, only takes a minute.

On Tuesday, July 30, 2013, Shaun Jackman wrote:

I believe @traymond https://github.com/traymond is looking into this issue.

— Reply to this email directly or view it on GitHubhttps://github.com/bcgsc/abyss/issues/19#issuecomment-21749987 .

sjackman commented 11 years ago

Ack, no, sorry. I've now forwarded him the logs.

ikitayama commented 11 years ago

Just curios how did you validate ABYSS-P against ABYSS back in the development days? Reads{1,2}.fastq are the data you used?

On Tuesday, July 30, 2013, Shaun Jackman wrote:

Ack, no, sorry. I've now forwarded him the logs.

— Reply to this email directly or view it on GitHubhttps://github.com/bcgsc/abyss/issues/19#issuecomment-21754143 .

ikitayama commented 11 years ago

Please note this is found in v1.3.5, ootb.

On Tuesday, July 30, 2013, Itaru Kitayama wrote:

Just curios how did you validate ABYSS-P against ABYSS back in the development days? Reads{1,2}.fastq are the data you used?

On Tuesday, July 30, 2013, Shaun Jackman wrote:

Ack, no, sorry. I've now forwarded him the logs.

— Reply to this email directly or view it on GitHubhttps://github.com/bcgsc/abyss/issues/19#issuecomment-21754143 .

sjackman commented 11 years ago

I don't recall which data set was used back in the day. Likely it was whichever data set I happened to be working on at the time. Our regression testing is rather ad hoc, I regret.

ikitayama commented 11 years ago

Thank you, Shaun. No problem.

On Tuesday, July 30, 2013, Shaun Jackman wrote:

I don't recall which data set was used back in the day. Likely it was whichever data set I happened to be working on at the time. Our regression testing is rather ad hoc, I regret.

— Reply to this email directly or view it on GitHubhttps://github.com/bcgsc/abyss/issues/19#issuecomment-21756261 .

traymond commented 11 years ago

FYI, I've been able to reproduce this issue.

ikitayama commented 11 years ago

Thank you for reproducing the issue, Tony.

On Tuesday, July 30, 2013, Tony Raymond wrote:

FYI, I've been able to reproduce this issue.

— Reply to this email directly or view it on GitHubhttps://github.com/bcgsc/abyss/issues/19#issuecomment-21757183 .

ikitayama commented 11 years ago

403 out of total 817 are the same assembled unitigs. @traymond, do you see the same?

sjackman commented 11 years ago

That's eerily around the 1/2 mark. Do the contigs differ only by orientation? Can you try running famd5 on the two assemblies?

ikitayama commented 11 years ago

Shaun - The -1.fa files md5 values are exactly the same. Contigs look different to me.

sjackman commented 11 years ago

If the md5 values are the same, the contig sequences are the same. They may differ in order and orientation, which is permitted.

ikitayama commented 11 years ago

so you sure do think the difference caused somewhere down the line?

sjackman commented 11 years ago

Yes. My guess is either abyss-filtergraph or PopBubbles is making a decision that is somehow dependent on the order or orientation of the contigs, even though it shouldn't. Either that or the result depends on the number of threads used. My guess is the former.

sjackman commented 11 years ago

Any progress on this, @traymond? Or thoughts?

traymond commented 11 years ago

No progress as of yet. I've been focused on getting the latest release done. I'll take a look now. I agree though, the likely culprit is filtergraph or popbubbles.

traymond commented 11 years ago

The assembly fasta files diverge starting at the -4.fa file. Therefore, the issue is in either the overlap algorithm or the alignments.

I'll look more at this tomorrow.

traymond commented 11 years ago

I'm quite certain that the output of DistanceEst is deterministic at this point, which would single out Overlap.

ikitayama commented 11 years ago

With the np=2, i.e., j=2, abyss-pe run configuration, I see the discrepancy. @sjackman it's the orientation as you suspect?

traymond commented 11 years ago

The bug is indeed related to the orientation of the contigs being different and how abyss-map chooses alignments when there are multiple best alignments at different query starts. Matches on the same strand as the read are given priority over matches to the reverse compliment of the read. Therefore, in some cases links are being made between different contigs even though the contig sequences are the same. So the difference in stats is caused by a select few alignments being different from one run to the next.

There is a right way to fix this, and a simple way to fix this. I want to report all alignments that have the best match length and with different sub-strings as separate alignments. Then create links between all combinations of the first read alignments and the second read alignments. Alternatively, I could just find a single arbitrary, yet deterministic, alignment. The former would be deterministic and not arbitrary.

ikitayama commented 11 years ago

@traymond than you for tracking the bug down. Is this independent of the aligner we choose?

sjackman commented 11 years ago

If there are two possible alignments with equal score, the mapping quality should be zero, and the alignment shouldn't be used by DistanceEst. abyss-map should, I feel, output exactly one alignment per read. ABYSS and ABYSS-P should probably output contigs in the same arbitrary orientation. That would be easy to fix.

traymond commented 11 years ago

This is happening because we have reads aligning with the same length to one side of a contig-contig overlap junction and the other. It isn't due to duplicate sequences. The mappings should not be mapq zero because they are good mappings either one we choose.

Regardless, it would be better to fix the PE phase so that it wasn't reliant on the order or orientation of the inputs. This is why I want to report all best mappings, as long as none are duplicates.

@ikitayama you can almost think of this as a "feature" of abyss-map. The paired-end algorithm is very sensitive to alignments, so if you switched aligners/alignments you'd have completely different stats again. Neither sets of files should have misassemblies due specifically to this.

sjackman commented 11 years ago

If there are two equally good mappings, and the choice is arbitrary, the mapping quality should be zero.

traymond commented 11 years ago

Why not report both alignments?

sjackman commented 11 years ago

When the read is split perfectly across two contigs, say 50 bp on one side and 50 bp on the other, it's a very special case of the more general issue of a read being spilt across two contigs with an uneven split (say 40 bp and 60 bp). In the latter case, we report only a single alignment. I'd rather treat the special case in the same fashion as the general case and report a single alignment.

traymond commented 11 years ago

But if we set mapq=0 then it isn't treating the two cases the same anyways.

sjackman commented 11 years ago

We could pick the 5'-most alignment on the read. That wouldn't depend on the orientation of the contig.

traymond commented 11 years ago

There's lots of ways of resolving this... I was suggesting something less arbitrary.

sjackman commented 11 years ago

Choosing the 5'-most alignment gives the longest-distance information (for FR-oriented reads) and still maintains the one-alignment-per-read condition. If we go to more than one alignment per read, I'd rather due it for all split reads, not just this one special case.

traymond commented 11 years ago

I'll see how much work it ends up being.

ikitayama commented 11 years ago

Splitting inputs seems to give slightly different results as well. Unitigs are intact.

ikitayama commented 11 years ago

@traymond Will the fix land on 1.3.7?

traymond commented 11 years ago

Some form of a fix for this will be included in 1.3.7, yes.

ikitayama commented 11 years ago

Good news! I'd like to give the test branch try. Can I pull from your repository?

traymond commented 11 years ago

Between vacation and other research related tasks with higher priority I haven't had a chance to implement anything for this yet. Typically, we don't give access to our test branch, but I can share the patch when I've done it.

traymond commented 10 years ago

@ikitayama are you able to see the commit I pushed to abyss-dev?

With it, the stats are identical, but there are some minor differences from serial to parallel ABYSS assemblies. The stats are actually worse, but considerably more deterministic than before. So I'm inclined to keep this patch.

traymond commented 10 years ago

Here's the patch anyways :) https://gist.github.com/traymond/eacfd7c27e6571534a17

ikitayama commented 10 years ago

Tony,

Thank you very much for the patch! I'll report back as soon as the test with the test read set is done on both x86 and sparc machines.

ikitayama commented 10 years ago

@traymond I am not able to see the abyss-dev branch presumably uploaded to bcgsc/abyss. I see only one master branch there.

ikitayama commented 10 years ago

@traymond I'm afraid the patch at the Gist wasn't cleanly applied to my branch at commit df48e4406ffb9353bc76fbebf63dd3f9f623bcfd, ie, I am at the 1.3.7 with no local patches. Would you mind recreating the patch?

sjackman commented 10 years ago

Fixed by 8319567