Readfish sometimes go into a stuck state

hasindu2008 commented 3 years ago

First of all, thank you very much for writing this nice piece of tool. We have been running it for a while and there is this rare situation where readfish freezes. That is , no more log messages like 2021-08-17 15:44:42,118 ru.ru_gen 24R/0.02245s are printed anymore and I believe that it has gone into a frozen state. In such instances, MinKNOW is still sequencing smoothly, however, without any rejections and likely to be that readfish is frozen.

Digging into the logs, I see that Guppy server has crashes around the same time that readfish froze.

2021-08-17 15:44:47.679933   ERROR: common_process_crashed (host)
    executable: /data1/software/ont-guppy/bin/guppy_basecall_server
    name: guppy
    arguments: --config
               dna_r9.4.1_450bps_fast.cfg
               --port
               5555
               --log_path
               /var/log/minknow/guppy
               --ipc_threads
               4
               --max_queued_reads
               5000
               --num_callers
               6
               -x
               cuda:0
    exit_code: 0
    recent_output: ONT Guppy basecall server software version 4.2.2+effbaf8, client-server API version 3.2.0
                   config file:         /data1/software/ont-guppy/data/dna_r9.4.1_450bps_fast.cfg
                   model file:          /data1/software/ont-guppy/data/template_r9.4.1_450bps_fast.jsn
                   log path:            /var/log/minknow/guppy
                   chunk size:          2000
                   chunks per runner:   160
                   max queued reads:    5000
                   num basecallers:     6
                   num socket threads:  4
                   max returned events: 50000
                   gpu device:          cuda:0
                   kernel path:
                   runners per device:  8
                   Starting server on port: 5555
                   [guppy/error] pipeline::ThreadedNode::worker_function: Exception thrown in basecaller_node worker thread: Exception thrown in Caller worker thread: Called read appears to be malformed for aggregation.
                   The basecall server has shut down successfully.

Any tips on how to resolve this?

mattloose commented 3 years ago

So the issue here is that the client-server relationship is broken and readfish is waiting for a response which never comes. The freeze is caused by the guppy crash and not readfish. Tracking these errors is problematic and also something that we don't see very often.

What hardware are you running on and what are your toml settings? Are you running HAC or fast (presume fast).

If you can share a bit more information we can try to help. We have been debating ways of trying to resolve this issue here s we might be able to suggest some work arounds.

Cheers

Matt

hasindu2008 commented 3 years ago

Thank you for the quick reply.

The hardware information is: AMD Ryzen Threadripper 3970X 32-Core Processor 128 GB of RAM NVIDIA 3090 GPU

Software information is: Ubuntu 18.04.5 LTS readfish 0.0.5a1 Guppy 4.2.2

My toml files look like below:

[caller_settings]
config_name = "dna_r9.4.1_450bps_fast"
host = "127.0.0.1"
port = 5555

[conditions]
reference = "/data1/readfish/hg38noAlt.idx"

[conditions.0]
name = "ReadFishhg38"
control = false
min_chunks = 0
max_chunks = inf
targets = "ReadFish.hg38.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

I am not sure how the server-client architecture is implemented in readfish as my Python know-how is limited. If the underlying transport layer is TCP (likely to be for this kind of application where reliability is important), perhaps it is possible to use the TCP keepalive feature in the kernel. In C this feature can be set in the client side after the socket() systemcall and before connect() system call - not sure how to do this in Python.

mattloose commented 3 years ago

Interesting - will have a look at that.

I would suggest that having the max_chunks set to inf is not actually a good plan here. This can lead to you rejecting reads which have transolcated a long way through the pore and lead to increased blocking. Assuming you have break_reads set to 0.4 I would set max_chunks to 12 or 16.

This may also reduce the occurrences of crashes though given you are on a 3090 I would be surprised if that were the underlying reason.

You might also wish to update to guppy 5 - this does work with readfish and you will get whatever internal improvements are available there as well.

hasindu2008 commented 3 years ago

Thanks for the tip. I will set this max_chunks to 12-16. Is there any document where those toml parameters are explained?

It is nice to hear Guppy 5 is supported - it is in the release 0.0.6 I believe? For Guppy 4 that adaptor scaling had to be disabled to get 0.02ish latencies. Is the same trick or more tricks required for Guppy 5?

alexomics commented 3 years ago

At the moment, max_chunks is the maximum number of times that we evaluate a read (this used to be synonymous with the number of chunks, but not since read_until v3.0). This param is there to prevent attempting to unblock molecules that have progressed too far and are more likely to not be ejected.

So the ideal value is determined by your value for break_reads_after_seconds (in the MinKNOW conf TOML) sequencing speed and the maximum length you want to unblock. max_chunks of 12-16 gives a max unblock length between ~2-2.5kb when collecting chunks every 0.4s.

Guppy 5 should work fully with v0.0.6, just make sure you get the right client-lib version. Adapter scaling seems to have been removed in Guppy 5, so no fiddling needed.

hasindu2008 commented 3 years ago

Thank you. Much appreciated!

mattloose commented 3 years ago

It's worth noting that a lot of these issues are undocumented (something we will address) but are also quite nuanced around the experiment you want to do.

Unblocking lots of long molecules is detrimental to flow cell performance though and should be avoided. We've made some expensive mistakes in the past!

hasindu2008 commented 3 years ago

Today we got readfish version 0.6.0 setup on the gridION with guppy 5.0 and seem to work so far. It is using fast base-calling mode (see config file below), however, the times are a bit high - I did not change any guppy config options to disable adaptor scaling as you previously mentioned it is removed in guppy 5.

[caller_settings]
config_name = "dna_r9.4.1_450bps_fast"
host = "127.0.0.1"
port = 5555

[conditions]
reference = "/data/readfish/hg38noAlt.idx"

[conditions.0]
name = "ReadFish_v6_gene_targets.collapsed.hg38"
control = false
min_chunks = 0
max_chunks = 16
targets = "ReadFish_v6_gene_targets.collapsed.hg38.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

Any tips on improving the latency?

mattloose commented 3 years ago

Actually, I don't think your latency needs much improving here. If you look at your batch size, it is an order of magnitude higher than we report in our original testing and setup guide - i.e you are getting reads in batches of 70-110 reads and processing them in less than 0.2s - which is giving you around 0.0015s per read which compares very favourably with our previous per read speed of around 0.005s per read.

So if you run this on a playback file we suspect you will see something like that reported in #162 - excellent performance.

If someone wants to set us up with a 3090 and chunky server we'd be happy to test here :-)

mattloose commented 3 years ago

Hmm - just re-read your message and this is on GridION? I'll benchmark again here and see what we are getting.

hasindu2008 commented 3 years ago

Well, I have not checked Guppy 5 with the 3090 yet.

On 3090 with Guppy 4 (fast base calling and adaptor scaling disabled) the times are:

2021-10-05 14:41:13,982 ru.ru_gen 24R/0.02217s
2021-10-05 14:41:14,186 ru.ru_gen 27R/0.02598s
2021-10-05 14:41:14,287 ru.ru_gen 28R/0.02641s
2021-10-05 14:41:14,387 ru.ru_gen 30R/0.02593s
2021-10-05 14:41:14,496 ru.ru_gen 26R/0.03549s
2021-10-05 14:41:14,589 ru.ru_gen 34R/0.02761s
2021-10-05 14:41:14,681 ru.ru_gen 19R/0.02028s
2021-10-05 14:41:14,785 ru.ru_gen 22R/0.02319s
2021-10-05 14:41:14,997 ru.ru_gen 36R/0.03550s
2021-10-05 14:41:15,086 ru.ru_gen 29R/0.02417s
2021-10-05 14:41:15,191 ru.ru_gen 25R/0.02905s
2021-10-05 14:41:15,287 ru.ru_gen 23R/0.02486s
2021-10-05 14:41:15,392 ru.ru_gen 34R/0.02921s

Times are similarly low on the 3070 card as well - again Guppy 4 (fast base calling and adaptor scaling disabled).

2021-10-05 14:44:56,906 ru.ru_gen 27R/0.03097s
2021-10-05 14:44:57,009 ru.ru_gen 35R/0.03429s
2021-10-05 14:44:57,102 ru.ru_gen 30R/0.02673s
2021-10-05 14:44:57,199 ru.ru_gen 21R/0.02436s
2021-10-05 14:44:57,408 ru.ru_gen 20R/0.03220s
2021-10-05 14:44:57,510 ru.ru_gen 33R/0.03497s
2021-10-05 14:44:57,602 ru.ru_gen 21R/0.02672s
2021-10-05 14:44:57,703 ru.ru_gen 22R/0.02730s
2021-10-05 14:44:57,801 ru.ru_gen 25R/0.02539s
2021-10-05 14:44:57,904 ru.ru_gen 28R/0.02819s
2021-10-05 14:44:58,000 ru.ru_gen 22R/0.02317s
2021-10-05 14:44:58,211 ru.ru_gen 22R/0.03430s
2021-10-05 14:44:58,306 ru.ru_gen 34R/0.02919s

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 year ago

This issue was closed because there has been no response for 5 days after becoming stale.

LooseLab / readfish

Readfish sometimes go into a stuck state #159