LooseLab / readfish

CLI tool for flexible and fast adaptive sampling on ONT sequencers
https://looselab.github.io/readfish/
GNU General Public License v3.0
167 stars 31 forks source link

TOML target specification #242

Closed jennieli421 closed 1 year ago

jennieli421 commented 1 year ago

If I would like to enrich for all reads that are mapped to a reference, what should I put as targets?

Adoni5 commented 1 year ago

Hi @jennieli421,

I think that would be

single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"

Does that make sense?

jennieli421 commented 1 year ago

Oh, I meant to ask how I should writ the line of targets =. Should I write targets = []? In that case, I got

2023-06-27 14:59:45,694 ru.ru_gen Region 'select_target_genes' (control=False) has 0 contigs of which 0 are in the reference. There are 0 targets (including +/- strand) representing 0.0% of the reference. Reads will be unblocked when classed as single_off or multi_off; sequenced when classed as single_on or multi_on; and polled for more data when classed as no_map or no_seq.

which seemed like no read can be mapped to the reference.

Adoni5 commented 1 year ago

Oh I see! What you should do is either list all the chromosome names in the reference like so

targets = ["chr1", "chr2"...]

Or you can leave the targets empty,

targets = []

And reverse the toml

single_on = "unblock"
multi_on = "unblock"
single_off = "stop_receiving"
multi_off = "stop_receiving"
min_chunks = 0
max_chunks = 2
no_map = "proceed"
no_seq = "proceed"

This way anything that maps to the reference is single_off or multi_off, anything that doesn't map after 2 chunks is unblocked. You may want to change no_map to unblock, which means that anything that doesn't map is unblocked immediately rather than left to acquire another chunk.

Hope this helps!

danchurch commented 1 year ago

Hello! This issue/discussion is helpful to me, thank you both. I hope this has resolved jennieli421's issue. But since this is still open, I have a very similar use case - I want to deplete an entire genome. But I'm unclear, I have a question from your above TOML file - can you clarify the "stop-receiving" setting?

From your Nature paper I get this definition: "stop receiving data for the remainder of that read". This makes me think that this read is also rejected, similar to unblocking. But in the above TOML file with empty targets, it seems like you are saying that "stop_receiving" allows the read to be sequenced?

If so, what is the difference between "stop_receiving" and "proceed"?

mattloose commented 1 year ago

Proceed means that one collects more data for an individual read and you assess it again. Stop_receiving tells the sequencer to keep sequencing that read and not evaluate it again.

So - unblock means a read will be rejected from the pore and a new one sampled. proceed means the read will continue to sequence and the next batch of singal will be analysed again. stop_receiving means send no more data about this read and let it sequence to normal completion.

I hope that helps!

danchurch commented 1 year ago

Totally. You guys are great, btw. This is amazing software.

danchurch commented 1 year ago

Here is another related question. In the above example for enriching a genome, with no listed targets you give these settings:

targets = [] single_on = "unblock" multi_on = "unblock" single_off = "stop_receiving" multi_off = "stop_receiving" min_chunks = 0 max_chunks = 2 no_map = "proceed" no_seq = "proceed"

You mention above that any read that is categorized as single_off or multi_off is a read that maps to the genome. But what then are the "single_on" and "multi-on" reads? These would normally be reads that map to the targets, but there are no targets. Do these settings become irrelevant when the targets = []?

Adoni5 commented 1 year ago

Sorry, the documentation isn't very clear on this! I'll update it

A break down:

  1. single_on - Basecalled read signal fragment produced a single mapping, which does start within a target region
  2. multi_on - Basecalled read signal fragment produced multiple mappings, of which At least one starts within a target region
  3. single_off - - Basecalled read signal fragment produced a single mapping, which does not start within a target region
  4. multi_off - Basecalled read signal fragment produced multiple mappings, of which None start within a target region
  5. no_map - Read signal fragment did not align to the reference
  6. no_seq - Read signal fragment did not base call

So in this TOML

single_on = "unblock"
multi_on = "unblock"
single_off = "stop_receiving"
multi_off = "stop_receiving"
min_chunks = 0
max_chunks = 2
no_map = "proceed"
no_seq = "proceed"

single_on and multi_on are impossible, because there are no targets. single_off and multi_off are the conditions for any reads that align to the reference, as the whole reference is technically off target, as there are no targets no_map covers any reads that do not map to the reference

So any read which does not align in 2 chunks is unblocked (max_chunks =2), and read which aligns to the reference is sent a stop_receiving, and sequenced

Does that make sense?

danchurch commented 1 year ago

Perfect, thanks!

Adoni5 commented 1 year ago

Cool gonna close this for now. Feel free to reopen if you need more help