ekg / seqwish

alignment to variation graph inducer
MIT License
143 stars 19 forks source link

Length assertion exception during GFA generation. #28

Open darcyabjones opened 4 years ago

darcyabjones commented 4 years ago

Hi there!

FIrst off, thanks for your great work on this. I've been having heaps of fun trying to figure out working combinations of different genome graph tools over the last few weeks.

I'm running into an issue while attempting to squish alignments from a large number of assembled genomes (173 genomes, some pacbio, some illumina).

I've split the input contigs into components based on a coarse alignment of some of the complete genome assemblies. Some of the components will complete 'squishing' as expected, and the GFAs look great. Others will crash during the GFA generation step with the following error:

$ seqwish       -s "genomes.fasta"       -p "alignments.paf"       -g "component22.gfa"       --threads 28

length for 15FG105.scaffold0051.0_46565, expected 46565 but got 0
seqwish: /tmp/src/gfa.cpp:133: void seqwish::emit_gfa(std::ostream&, size_t, const string&, mmmulti::iitree<long unsigned int, long unsigned int>&, mmmulti::iitree<long unsigned int, long unsigned int>&, const sdsl::sd_vector<>&, const rank_1_type&, const select_1_type&, seqwish::seqindex_t&, mmmulti::set<std::pair<long unsigned int, long unsigned int> >&): Assertion `false' failed.

The output GFA is populated with some of the expected sequences and paths, but seems to be truncated at the point that it encounters the failing sequence.

Does this assertion mean anything to you?

I can check the input file and see that the sequence length is 46565 bp. There are plenty of alignments for this contig in the PAF file that are ~40000 bp long.

If I remove the scaffold from the input fasta, exclude PAF alignments for this scaffold and re-run seqwish, it will continue but usually run into another failing contig. For one component, i've been able to keep removing problematic contigs (about 10) until it will complete.

Unfortunately, i haven't been able to find a small reproducible example to debug with. I'm happy to share the fasta/alignments with you, but it really does take a while to process them (~8hrs with 28 cpus 128 Gb RAM). Smaller test sets appeared to work without issue.

I'll keep trying to simplify it, but the fact that it's an assertion error makes me think that it might be a problem that you've anticipated.

Thanks in advance for your time, Darcy

Some runtime details

I am running inside a singularity container using Debian buster as the base image (gcc 8.3.0-1). The seqwish commit is: 62f0055

The container Dockerfile is in https://github.com/darcyabjones/fungraph/blob/master/containers/seqwish.Dockerfile. The repo is at https://github.com/darcyabjones/fungraph.

Alignments from minimap are filtered to have at least 10,000bp aligned outside of repeat regions. The sequences are soft-masked, I haven't tried converting them to uppercase but I can't see that being an issue since other components work. There are no redundant/non-standard characters in the sequence. The fasta sequence ids are globally unique and are split up like so: <isolate_id>.<scaffold_name>.<contig_start>_<contig_end>.

Cheers!

ekg commented 4 years ago

I'm glad you're enjoying seqwish! It's meant to be a simple, dumb graph inducing kernel, so it should b. The check you've run into is part of the validation that the graph is lossless with respect to its input sequences. The particular error you're getting is concerning, and would indicate that something has broken in the process.

First, can I confirm that you're working with seqwish 62f0055271fd6825e736151ae60efa175a92bf9c (current HEAD)?

Second, make sure that these problematic sequences are only included one time in the input. I don't think this is checked for properly now and could introduce a strange error. You think this is the case so let's assume that's not a problem.

The sequences are soft-masked, I haven't tried converting them to uppercase but I can't see that being an issue since other components work.

This definitely could be a problem. It's not something that seqwish is meant to handle, and I believe that I've crashed seqwish in the past for the same reason. Probably, I should fix this by upper-casing things as they're read in. At the moment, you can hack it with awk like this:

cat x.fasta | awk '/^>/ { print; } !/^>/ {print toupper($0)}' >x.uc.fasta

If that's not the issue, then we should exchange a test case and I can dig in further.

If it is the issue, then I'll update seqwish to guard against this on sequence import.

ekg commented 4 years ago

To clarify a little what might be happening: The code in seqwish doesn't handle matching of reverse complemented soft-masked (lower case) sequences. If you have only alignments to soft-masked sequences, it might break some assumption in the seqwish process, resulting in an incomplete input. Or, it could be that this is also tripping up the check of sequence integrity. Basically, seqwish doesn't handle lower case bases.

ekg commented 4 years ago

Also, sorry for quickly scanning your message and writing a confused response re-asking you for details like the commit id. I see you'd already answered most of my questions.

darcyabjones commented 4 years ago

:) No worries. I did actually forget to include the commit id initially and hastily edited the issue. You might have caught it before then.

Thanks for your response.

I've re-run the task after uppercasing the sequences as suggested, but I get the same error. I also tried only including sequences and alignments that relate to that first failing path (15FG105.scaffold0051.0_46565) which ran to completion. So even though it always fails at the same point if I restart it, it doesn't seem like the issue is related to this sequence/alignment itself.

If I run it with the -debug parameter it seems to dump a bit-vector just before crashing.

After a bit of tinkering i've got a slightly smaller sample that reproduces the problem. Essentially it's just a random sample of 50 sequences and the alignments for those sequences. This time it fails while writing the path for 15FG109.scaffold0047.0_94873. Takes about 5mins for me, but it might actually be a bit faster on a laptop/desktop because I suspect some IO heavy steps are a bit slow on the NFS drive.

Alignments and uppercased sequences are here: https://www.dropbox.com/s/z6ijjde3oukkm13/seqwish_example.tar.gz?dl=0

Darcy

ekg commented 4 years ago

Great, I can work from this. It might take me a few days to get to it. I'm working on a paper now.

Are you running seqwish in parallel? That speeds up the sorts a lot, which is a big part of the runtime.

The other big runtime cost is the union find ("transitive closure") which is single threaded and running too many interval tree queries. I have an update to make that run in parallel while removing the vast majority of interval tree queries, but it will be tricky to implement and might take a few weeks.

On Fri, Nov 15, 2019, 07:14 Darcy Jones notifications@github.com wrote:

:) No worries. I did actually forget to include the commit id initially and hastily edited the issue. You might have caught it before then.

Thanks for your response.

I've re-run the task after uppercasing the sequences as suggested, but I get the same error. I also tried only including sequences and alignments that relate to that first failing path (15FG105.scaffold0051.0_46565) which ran to completion. So even though it always fails at the same point if I restart it, it doesn't seem like the issue is related to this sequence/alignment itself.

If I run it with the -debug parameter it seems to dump a bit-vector just before crashing.

After a bit of tinkering i've got a slightly smaller sample that reproduces the problem. Essentially it's just a random sample of 50 sequences and the alignments for those sequences. This time it fails while writing the path for 15FG109.scaffold0047.0_94873 . Takes about 5mins for me, but it might actually be a bit faster on a laptop/desktop because I suspect some IO heavy steps are a bit slow on the NFS drive.

Alignments and uppercased sequences are here: https://www.dropbox.com/s/z6ijjde3oukkm13/seqwish_example.tar.gz?dl=0

Darcy

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/28?email_source=notifications&email_token=AABDQEOSCBJKFKITQ37DUILQTY425A5CNFSM4JNOEG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEENVNI#issuecomment-554228405, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOOSTDSXXHYJKOI3I3QTY425ANCNFSM4JNOEG4A .

darcyabjones commented 4 years ago

I'm so sorry. I just realised that I included the wrong fasta file in that tarball (some sequence ids in the alignments weren't in the fasta).

Here is the corrected version: https://www.dropbox.com/s/ty3i4gesifh8kaj/seqwish_example.tar.gz?dl=0

I removed the old link to avoid confusion.

Yes I've been allowing multiple threads (up to 28). For the steps that support parallelisation it seems to work well, but I found that the vast majority of time was spent in a single threaded step so i was just wasting allocated cores, and there seems to be a second parallel step that creates multiple temp files for each thread but only writes to one file.

Probably the part that I thought was our slow NFS drive was your interval tree step. Process introspection is a bit hard on our cluster.

Anyway thanks again. Good luck with the paper writing.

ekg commented 4 years ago

The single file is then sorted in parallel. The parallel writing is to not have any locks on the parallel step. But it costs a copy of the data which can be expensive.

I want to figure out how to write directly into one file in chunks. This is conceptually easy but tricky to implement. A worker thread needs to run ahead allocating buffers and passing them to the worker threads.

Everything is memory mapped, so if you have enough RAM to cache the file then it should run fast.

On Fri, Nov 15, 2019, 15:20 Darcy Jones notifications@github.com wrote:

I'm so sorry. I just realised that I included the wrong fasta file in that tarball (some sequence ids in the alignments weren't in the fasta).

Here is the corrected version: https://www.dropbox.com/s/ty3i4gesifh8kaj/seqwish_example.tar.gz?dl=0

I removed the old link to avoid confusion.

Yes I've been allowing multiple threads (up to 28). For the steps that support parallelisation it seems to work well, but I found that the vast majority of time was spent in a single threaded step so i was just wasting allocated cores, and there seems to be a second parallel step that creates multiple temp files for each thread but only writes to one file.

Probably the part that I thought was our slow NFS drive was your interval tree step. Process introspection is a bit hard on our cluster.

Anyway thanks again. Good luck with the paper writing.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/28?email_source=notifications&email_token=AABDQEPCU6OGARLZBJT2MJ3QT2V33A5CNFSM4JNOEG4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEFR4YY#issuecomment-554376803, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJUGUDXAA5H5ZS42CTQT2V33ANCNFSM4JNOEG4A .