Fix GFA1 path overlaps, revisited

jurikuronen commented 2 years ago

Hello again @jamshed,

Thanks for your quick reply to my earlier pull request (https://github.com/COMBINE-lab/cuttlefish/issues/8). I also much appreciate that you took the issue under work immediately!

I've been away from this work for a couple weeks and now that I've returned, I noticed that my previous bug fix was issued hastily. I am embarrassed having to submit a new pull request on the same matter.

In the previously proposed fix, it is indeed true that the overlap buffer(s) already contain the correct overlaps and thus removing the manually output first overlap corrects the issue. However, the proposed change to writing the overlap buffers introduced a new bug. My earlier fix to remove the first comma only works in a single-threaded context. The commas are naturally necessary when the overlap buffers of multiple threads are concatenated, otherwise the first overlap from each thread won't be separated from the last overlap of the previous thread. E.g., "...,30M,30M,..." becomes "...,30M30M,...".

Since the thread-specific overlap output files are written to the GFA output file in order, changing the line if (!o_buffer.empty()) o_buffer += ","; to if (thread_id > 0 || !o_buffer.empty()) o_buffer += ","; adds a comma in front of the overlap buffers of other threads, while preventing the first thread from having the erroneous extra comma.

While this fixes the new bug, and hopefully doesn't introduce further new bugs, you might of course want to consider a cleaner implementation. Fortunately the previous fix had not made its way to the main branch yet. Since the only problem with the main branch version is that the first overlap in the GFA1 path overlaps fields is a duplicate, current users can either edit it out by some script or implement an indexing correction in their code bases if they need or use the overlaps fields.

Best regards, Juri Kuronen

jurikuronen commented 2 years ago

Hello again,

Bad news! :cry:

I've been using a local version of Cuttlefish with the above fix implemented and I'm again embarrassed to say that the writing part of the overlaps is still bugged for very large paths. Apparently the o_buffer of the "first thread id" is re-used in such large cases, causing again missing commas.

In the particular case where the bug happened again, the path was very large and had 366 256 "SegmentNames". The corresponding "Overlaps" field spanned 1 465 018 characters in total and was missing one comma at position 102 401. That is, two overlap values were incorrectly combined as "...,30M,30M30M,30M,...". For the record, I was running Cuttlefish with 16 threads.

I re-ran Cuttlefish on the same data with the main branch version and there were no missing commas, only the extra duplicate overlap value that my patches so far have (unsuccessfully) attempted to fix.

While the proposed fix above currently does not fully fix the issue, I'm still leaving the pull request here for more discussion as for how to properly fix the bug. It is apparent that I don't understand the code base well enough to implement a proper and clean patch.

For now, for users (such as me) who need the full path information, using the main branch version of Cuttlefish with an indexing correction for the overlaps is the best procedure.

Best regards Juri Kuronen

jamshed commented 2 years ago

Hello @jurikuronen,

Thanks a lot for bringing this to our attention, and no worries on the other newly introduced bug. I'll try to track down what's happening here---we were busy the last few days with a new version of the work, so I couldn't check this issue.

So as you mentioned:

I re-ran Cuttlefish on the same data with the main branch version and there were no missing commas, only the extra duplicate overlap value that my patches so far have (unsuccessfully) attempted to fix.

Have you also tried re-running Cuttlefish a second time with your local version? It would be great if you may have a chance to try this out, and see if the bug is reproducible---that'd help us to determine whether some race condition might be occurring underneath somewhere.

jamshed commented 2 years ago

Hi @jurikuronen,

It would also be great if you could be able to provide us with the exact input data that you have used in this case, along with the exact command that you executed. Then I could try to produce the behavior myself.

jurikuronen commented 2 years ago

Hi @jamshed,

Sorry for the long delay for my reply! I've now had time to look into this issue again. I've learned that Cuttlefish must process a lot of data for the bug to happen, hence I've not been able to construct a minimal example which reproduces this behaviour. The bug appears to be very sensitive to the number of threads used with the lower the number the more prevalent the bug. In particular, on my data, the coupling of overlap values will even happen multiple times per path when running Cuttlefish with only 1 thread. With a suitably high thread count, e.g. 24 threads, the bug disappears and we get correct output. Also, on higher thread counts when the bug affects only a few paths, it is always the same paths each time, so there is some determinism in play here.

I will email you and provide you with my input data, let's see if we can get to the bottom of this.

Cheers Juri Kuronen

jamshed commented 2 years ago

Hello @jurikuronen,

Thanks a lot for your email and the comprehensive pipeline to demonstrate the bug! It has been very helpful in reproducing the behavior and to primarily test the fix.

I've pushed a fix for the bug in the hotfix-gfa1 branch. It passes the sample tests that you have provided. It would be great if you could test it out on maybe some other large datasets that you've been using Cuttlefish for. Let me know what you observe!

Regards.

jurikuronen commented 2 years ago

Hi @jamshed,

I've now tested the version you linked on two bigger datasets we've been using with varying number of threads and k-mer length. In all cases, the overlaps field had the correct count of comma-separated elements and various statistics my programs reported agreed with previous values. It would appear that the issue has finally been fixed, good job!

Best regards Juri Kuronen

jamshed commented 2 years ago

Hi @jurikuronen:

Thanks again for finding out the bug, and more so for the reproducible pipeline!

Your proposed solution at commit 41e52b9bfd171ebd2fd6d332a58f3e19cc62a1f8 would have definitely worked—unless the output buffer o_buffer had not been flushed to the disk at times, to keep the memory usage lower. Let us know of any other issues or support that you may need using cuttlefish!

Best regards.

COMBINE-lab / cuttlefish

Fix GFA1 path overlaps, revisited #9