edawson / gfakluge

A C++ library and utilities for manipulating the Graphical Fragment Assembly format.
http://edawson.github.io/gfakluge/
MIT License
51 stars 20 forks source link

fillseq: Output is not deterministic #37

Closed sjackman closed 6 years ago

sjackman commented 6 years ago

Sometimes it adds the sequence, and sometimes it leaves it as *. The following two runs were run one after the next with no change to the files. The GFA and FASTA file are produced by Canu. I'm able to reproduce this issue on both Linux and macOS.

$ gfak fillseq -f foo.fa foo.gfa | head -c50 | head -n2
H   VN:Z:1.0
S   tig00000001 CCTGGTGGGACTGAACCAATTGAAT
$ gfak fillseq -f foo.fa foo.gfa | head -c50 | head -n2
H   VN:Z:1.0
S   tig00000001 *   LN:i:26723
edawson commented 6 years ago

I get something like this on my machine now, too, although it's an intermittent segfault rather than just a failure to place the sequences. fillseq seems very sensitive to pointer weather. My guess is I've leaked some memory somewhere.

Sorry for the false hope! I'm trying to track the issue down today.

edawson commented 6 years ago

I think this is due to me forgetting to initialize name_len to 0: https://github.com/edawson/gfakluge/blob/6c533259c51c4f5d3a7ce11817f9037548ac1eb7/src/tinyFA/tinyfa.hpp#L54

I'm going to fix, test some more, push and cut a fresh release.

Any chance you could share a portion of your foo* files? I have some small test cases but always good to hoard new ones.

sjackman commented 6 years ago

Thanks for the bug fix, Eric.

g++ -Wall is good at catching variables used before initialization. I've opened PR #38 to enable -Wall.

sjackman commented 6 years ago

We don't appear to be entirely out of the fire just yet. With this fix, all thirty sequences are getting filled. That's good. The sequences output by gfakluge fillseq are consistently shorter than the sequences in the FASTA file.

ID           FAI     Fillseq  Difference
tig00000001  26723   26459    -264
tig00000002  26606   26343    -263
tig00000003  419498  415345   -4153
tig00000004  353469  349970   -3499
tig00000005  580268  574523   -5745
tig00000006  35814   35460    -354
tig00000007  555817  550314   -5503
tig00000008  27340   27070    -270
tig00000009  370719  367049   -3670
tig00000010  109905  108817   -1088
…
sjackman commented 6 years ago

I've sent you the Canu unitigs FASTA and GFA file in an e-mail.

The following awk script will add the sequences to the GFA file, which you can use to confirm that gfakluge fillseq is producing identical output.

seqtk seq canu.unitigs.fasta | gawk -vOFS='\t' 'ARGIND == 1 { id = substr($1, 2); getline; x[id] = $1; next } $1 == "S" { $3 = x[$2] } 1' - canu.unitigs.gfa >canu.unitigs.seq.gfa
edawson commented 6 years ago

I chased this down to where sequences are read from FASTA files. I forgot to account for newline characters, meaning a character of sequence was lost for every additional line in a record.

This should be fixed now. I'm going to test on your Canu contigs.

sjackman commented 6 years ago

I'm still seeing differences in sequence length with revision 960103871bc033a670e9f3c40a1077a0814d9fde.

ID           FAI     Fillseq  Difference
tig00000001  26723   26721    -2
tig00000002  26606   26604    -2
tig00000003  419498  419458   -40
tig00000004  353469  353435   -34
tig00000005  580268  580212   -56
tig00000006  35814   35811    -3
tig00000007  555817  555764   -53
tig00000008  27340   27338    -2
tig00000009  370719  370684   -35
tig00000010  109905  109896   -9
…
edawson commented 6 years ago

Well, closer. Probably some variation on the same error. I'll keep plugging at a fix.

edawson commented 6 years ago

Another error was found - I forgot to terminate name strings with a null char, so we missed keys in the map when doing lookup. This won't fix the missing sequence bits, which still boggles me a bit. Hoping to chase it down this week.

sjackman commented 6 years ago

Thanks for plugging away at this issue, Eric.

edawson commented 6 years ago

@sjackman I think this is finally tracked down and fixed in 0.3.1. There were numerous bugs that got squashed.

sjackman commented 6 years ago

Thanks, Eric!

due to weird pointer weather

😂

I've updated the formula in Brewsci/bio. https://github.com/brewsci/homebrew-bio/pull/429

sjackman commented 6 years ago
❯❯❯ diff -s orig.gfa fillseq.gfa
Files orig.gfa and fillseq.gfa are identical

❤️ 🍾