Problems connecting to MinKnow

pspealman commented 2 years ago

Hello,

Trying to run our first adaptive sampling experiment using ReadBouncer. We verified that the playback simulation works (and then undid the changes by setting data_generation.simulated_device=0 and device.simulator_device_count=0). And that the sequencer works without readbouncer. But when trying to use readbouncer we get the following error:

our toml file:

`usage = "target" output_directory = 'C:\Program Files\ReadBouncer\output' log_directory = 'C:\Program Files\ReadBouncer\'

[IBF]

kmer_size = 13
fragment_size = 100000
threads = 3
target_files = ['E:\readbouncer_trial\output_dir\nucleotide_fasta_ARG.ibf', 'E:\readbouncer_trial\output_dir\nucleotide_fasta_Integron_class_genes.ibf'] exp_seq_error_rate = 0.1

[MinKNOW] host = "127.0.0.1" port = 9501
flowcell = "FAT09957"

[Basecaller] caller = "DeepNano"
threads = 3

`

We are using Windows 10

JensUweUlrich commented 2 years ago

Hi @pspealman

could you try again by changing line port = 9501 into port = "9501".

Cheers Jens

pspealman commented 2 years ago

Thanks for the rapid reply Jens, unfortunately it did seem to make a difference.

I also just checked the Minknow_api and it isn't seeing a port number either (as in screen shot)

pspealman commented 2 years ago

I was running MinKnow 21.06.0 and am now running MinKnow 22.05.5 and am still having the same 'failed connection' issue with ReadBouncer and the same 'localhost:None' message with minknow_api.

JensUweUlrich commented 2 years ago

Oh wait... now I see the problem. It seems our parameter flowcell = "FAT09957" is kind of misleading. It should point to the device name. So in your case, it has to be flowcell = "MN28420". Sorry for the confusion. I guess we need to change that in the next release. I hope you can roll back to MinKNOW version 21.06.0, because our latest release does not support MinKNOW core 5.1.

pspealman commented 2 years ago

Wonderful - I rolled back to 21.06.0 (luckily I still had the original exe, these are no longer accessible through ONT) and then changed the flowcell parameter to point to the device name and it is now connecting!

(on a side note, changing the port = 9501 into port = "9501" does not seem to make a difference).

JensUweUlrich commented 2 years ago

Great! I'm happy that I could help you with the connection issue. Kudos also for storing the older installation exe. Now I'm curious if you can achieve enrichment using DeepNano-blitz and ReadBouncer ;-)

pspealman commented 2 years ago

Well, I think the answer to that is a cautious 'maybe'.

We sequenced '1.06 M' reads using two small ibfs as enrichment targets. We found 738816 as categorized but checking the 'read_until_decision_stats.csv' file we find that all the decisions were 'unblock' and the median 'read_nr' is 5544 and the median 'sequence_length' is 286.

We'll do basecalling and alignment tonight and see what the reads look like.

pspealman commented 2 years ago

So the answer is, unfortunately, 'no'. We see similar rates of enrichment with ReadBouncer and without. There are sequences that I think should be identified as positive - that is the target occurs in the first few hundred bases of the fragment - but which are not being identified as positive in the 'read_until_decision_stats.csv' file.

A couple of questions - could it be a performance issue? We are using a ibf with about 4000 targets with an average length of 1500 bp. Could something as easy as adding more RAM help in performance?

Also is there a way to increase the amount of sequence read before a decision is made? The documentation seems to say that the first 1500 bases are read before a decision is made but we see sizes much smaller than that (~600) being rejected.

JensUweUlrich commented 2 years ago

First of all, what is the definition of 'positive' in your experiment? An unblocked read or a target sequence you want to enrich for?

If I understand your experimental setup correctly, you're sequencing some bacterial samples and want to enrich for ARGs and Integrons, providing those sequences in the target filters. With 4,000 genes (bins), the IBF should be small enough to fit into the RAM and have no performance issues. However, in your last run, you provided a third IBF with 95,286 bins, which seems to need 12 GB of RAM. This could definitely pose an issue for the classification algorithm and I guess you would need at least 8-10 classification threads to keep up with the base caller. You can check this by having a look at the ReadBouncer.log file. I assume you will see an overflow of the classification queue.

Regarding your last question, ReadBouncer takes a minimum read length of 200 bp to make a decision but only tries to unblock reads until they have reached a maximum length of 1,500 bp. Thus unblocked read lengths of ~600bp are expected. However, if you experience performance issues, the unblocked read lengths will be much larger because the classification algorithm is not fast enough to process reads from the internal classification queue.

pspealman commented 2 years ago

Thanks for the feedback, you definitely understand the nature of the experiment. "Positive" in this case is that the target sequence we are looking to enrich for.

The 95,286 bin was all ribosomal variable sequences (LSUs) - we tried it with that noticed the categorization time was too high (only 23 out of ~100,000 were classified) so we stopped, removed it. We restarted using 4634 (ARGs) and 4 (Integrons) and started getting much better categorization (738816 out of 901879 reads) - but all of these were 'unblock' decisions.

I'm surprised that the minimum window is 200bp, I must have misunderstood the documentation. To be clear, we were hoping ReadBouncer would scan the first 1500bp for a possible match before making a decision. I had thought that, if only a 'target_files' (enrichment) database was used, if we had no matches within the first 1500bp it would be 'unblocked' (rejected). I'm not clear how a read gets rejected using only a 'target_file' database.

JensUweUlrich commented 2 years ago

We intended to set the minimum prefix size to 200 bp because the classification specificity drops significantly for smaller sequences, while after 1500 bp of sequencing the classification decision does not change anymore. We could easily add a parameter to the config where you would specify the minimum and maximum prefix length of a read for classification. However, if only the suffix of the read matches with the target database, the number of matching k-mers will not be large enough to consider the read a match and thus ReadBouncer will unblock/reject it. To overcome this issue, you would need to add the surrounding ~1000 bp of the ARGs or increase the expected sequencing error rate in order to get good classification results. Here, I would recommend using only single chunks (substrings) of a given length for classification instead of the whole prefix. This could probably help you and fits better with the use case. That means three additional parameters we need to add to ReadBouncer.

pspealman commented 2 years ago

Thanks for the clarification. I really appreciate your time in helping us with this. We would love to use ReadBouncer to perform adaptive sampling in educational and research environments that lack access to high end hardware.

We could easily add a parameter to the config where you would specify the minimum and maximum prefix length of a read for classification.

I think that would be helpful - and hopefully not just for me.

I understand and agree that the minimum prefix of 200 bp makes sense. When I was reading the method I misunderstood how the 200 bp chunks were being combined (so 400 bp, then 600 bp, etc) and then queried for a match. I had thought that if some substring within the 1500 bp had a significant number of kmer matches to a target then it would be a match (so, even 600 bp from the end of the 1500 bp would count as a match). But what it sounds like is that substring kmer matches within the 1500 bp are only counted across the whole length.

To overcome this issue, you would need to add the surrounding ~1000 bp of the ARGs or increase the expected sequencing error rate in order to get good classification results.

Unfortunately, because these are mobile genetic elements we don't know their flanking sequences.

Here, I would recommend using only single chunks (substrings) of a given length for classification instead of the whole prefix. I'm not sure I understand your suggestion. Would this be setting the 'max_chunk' IBF parameter to '1' and increasing the 'chunk_length' to '1500'?

Finally, I'm still confused on how we're getting 'unblock'/reject reads. The Adaptive Sampling section says:

b) If only target_files are provided, all reads that do not match to at least one target file will be unblocked.

but we have many 'undecided' reads (that are not 'positive' but are also not 'unblocked'/rejected). I presume these went all the way to 1500bp with no matches - but they weren't rejected?

Conversely, we also have many reads that are 'unblocked'/rejected but with a median 'sequence_length' of only 286. This suggests that no match was found in the first or second chunk and so decision was made to reject the read - but how is that different from the 'undecided' reads?

JensUweUlrich commented 2 years ago

I think that would be helpful - and hopefully not just for me.

I will open a bug for the feature but it could take some time until I have implemented it.

Unfortunately, because these are mobile genetic elements we don't know their flanking sequences.

That's exactly what I suspected.

Would this be setting the 'max_chunk' IBF parameter to '1' and increasing the 'chunk_length' to '1500'?

Not exactly, because this would mean the entire 1,500 bp prefix needs to match. I'm currently thinking about a design where you provide a minimum chunk length of i.e. 300 bp and switch on a "single_chunk_matching" parameter such that there is no concatenation of the chunks, but a single chunk classification, and if one chunk matches the IBF the read is classified for rejection (or sequencing as usual as in your target approach). This process repeats until a maximum read prefix length has been sequenced (providing a max_prefix parameter). I'm actually brainstorming ;-)

but we have many 'undecided' reads (that are not 'positive' but are also not 'unblocked'/rejected). I presume these went all the way to 1500bp with no matches - but they weren't rejected?

Within the read_until_decision_stats file you will only see "unblock" or "stop_receiving" decisions. In your use case "unblock" means that there was no match with your target IBFs on the prefix. "stop_receiving" means that after sequencing the first 1,500 bp no unblock decision could be made, which in your case means a full 1,500 bp prefix match with your target IBFs.

Conversely, we also have many reads that are 'unblocked'/rejected but with a median 'sequence_length' of only 286. This suggests that no match was found in the first or second chunk and so decision was made to reject the read - but how is that different from the 'undecided' reads?

That's correct. The first prefix chunk had no match with your target IBFs and thus the read was rejected.

pspealman commented 2 years ago

I'm actually brainstorming ;-)

Us too - if you'd like to have a chat outside of github I hope you feel free to reach out to me pspealman@nyu.edu

pspealman commented 2 years ago

Within the read_until_decision_stats file you will only see "unblock" or "stop_receiving" decisions. In your use case "unblock" means that there was no match with your target IBFs on the prefix. "stop_receiving" means that after sequencing the first 1,500 bp no unblock decision could be made, which in your case means a full 1,500 bp prefix match with your target IBFs.

This is interesting because we have no "stop_receiving" decisions. Only "unblock".

I've made the file available here, in the hopes that it helps. https://drive.google.com/file/d/1_Taekf8oSyxGOZiJtIVXDbnglaqCzCB3/view?usp=sharing

JensUweUlrich / ReadBouncer

Problems connecting to MinKnow #44