Closed sjackman closed 6 years ago
I get something like this on my machine now, too, although it's an intermittent segfault rather than just a failure to place the sequences. fillseq seems very sensitive to pointer weather. My guess is I've leaked some memory somewhere.
Sorry for the false hope! I'm trying to track the issue down today.
I think this is due to me forgetting to initialize name_len to 0: https://github.com/edawson/gfakluge/blob/6c533259c51c4f5d3a7ce11817f9037548ac1eb7/src/tinyFA/tinyfa.hpp#L54
I'm going to fix, test some more, push and cut a fresh release.
Any chance you could share a portion of your foo* files? I have some small test cases but always good to hoard new ones.
Thanks for the bug fix, Eric.
g++ -Wall
is good at catching variables used before initialization. I've opened PR #38 to enable -Wall
.
We don't appear to be entirely out of the fire just yet. With this fix, all thirty sequences are getting filled. That's good. The sequences output by gfakluge fillseq
are consistently shorter than the sequences in the FASTA file.
ID FAI Fillseq Difference
tig00000001 26723 26459 -264
tig00000002 26606 26343 -263
tig00000003 419498 415345 -4153
tig00000004 353469 349970 -3499
tig00000005 580268 574523 -5745
tig00000006 35814 35460 -354
tig00000007 555817 550314 -5503
tig00000008 27340 27070 -270
tig00000009 370719 367049 -3670
tig00000010 109905 108817 -1088
…
I've sent you the Canu unitigs FASTA and GFA file in an e-mail.
The following awk script will add the sequences to the GFA file, which you can use to confirm that gfakluge fillseq
is producing identical output.
seqtk seq canu.unitigs.fasta | gawk -vOFS='\t' 'ARGIND == 1 { id = substr($1, 2); getline; x[id] = $1; next } $1 == "S" { $3 = x[$2] } 1' - canu.unitigs.gfa >canu.unitigs.seq.gfa
I chased this down to where sequences are read from FASTA files. I forgot to account for newline characters, meaning a character of sequence was lost for every additional line in a record.
This should be fixed now. I'm going to test on your Canu contigs.
I'm still seeing differences in sequence length with revision 960103871bc033a670e9f3c40a1077a0814d9fde.
ID FAI Fillseq Difference
tig00000001 26723 26721 -2
tig00000002 26606 26604 -2
tig00000003 419498 419458 -40
tig00000004 353469 353435 -34
tig00000005 580268 580212 -56
tig00000006 35814 35811 -3
tig00000007 555817 555764 -53
tig00000008 27340 27338 -2
tig00000009 370719 370684 -35
tig00000010 109905 109896 -9
…
Well, closer. Probably some variation on the same error. I'll keep plugging at a fix.
Another error was found - I forgot to terminate name strings with a null char, so we missed keys in the map when doing lookup. This won't fix the missing sequence bits, which still boggles me a bit. Hoping to chase it down this week.
Thanks for plugging away at this issue, Eric.
@sjackman I think this is finally tracked down and fixed in 0.3.1. There were numerous bugs that got squashed.
Thanks, Eric!
due to weird pointer weather
😂
I've updated the formula in Brewsci/bio. https://github.com/brewsci/homebrew-bio/pull/429
❯❯❯ diff -s orig.gfa fillseq.gfa
Files orig.gfa and fillseq.gfa are identical
❤️ 🍾
Sometimes it adds the sequence, and sometimes it leaves it as
*
. The following two runs were run one after the next with no change to the files. The GFA and FASTA file are produced by Canu. I'm able to reproduce this issue on both Linux and macOS.