LooseLab / readfish

CLI tool for flexible and fast adaptive sampling on ONT sequencers
https://looselab.github.io/readfish/
GNU General Public License v3.0
167 stars 31 forks source link

A question on barcode-aware readfish #202

Closed hasindu2008 closed 11 months ago

hasindu2008 commented 2 years ago

Hi

I am trying to understand how barcode-aware readfish work. We are at the moment trying to multiplex 3 samples. All three samples are from the same reference and the target panel is the same for all. The goal is to enrich all target regions in all samples. For this, is the barcode_kits option necessary (eg: barcode_kits = ["EXP-NBD104"])? Can you explain what will happen if we run:

  1. with barcode_kits = ["EXP-NBD104"]
  2. without barcode_kits = ["EXP-NBD104"]

Also, another question is, will it be a problem if we use only 3 barcodes out of a 12 or 96 barcode kit? Will readfish unintendedly try to enrich those non-existent barcodes in that case?

Also is readfish compatible with this SQK-MLK111-96-XL barcoding kit?

hasindu2008 commented 2 years ago

I tried the following two configurations:

[caller_settings]
config_name = "dna_r9.4.1_450bps_fast_prom"
host = "ipc:///tmp/.guppy"
port = 5555
align_ref = "/data/readfish/hg38noAlt.idx"

[conditions]
reference = "/data/readfish/hg38noAlt.idx"

[conditions.0]
name = "ReadFish_v7_gene_targets.collapsed.hg38"
control = false
min_chunks = 0
max_chunks = 16
targets = "ReadFish_v7_gene_targets.collapsed.hg38.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"
[caller_settings]
config_name = "dna_r9.4.1_450bps_fast_prom"
host = "ipc:///tmp/.guppy"
port = 5555
barcode_kits = ["SQK-MLK111-96-XL"]
align_ref = "/data/readfish/hg38noAlt.idx"

[conditions]
reference = "/data/readfish/hg38noAlt.idx"

[conditions.unclassified]
name = "ReadFish_v7_gene_targets.collapsed.hg38"
control = false
min_chunks = 0
max_chunks = 16
targets = "ReadFish_v7_gene_targets.collapsed.hg38.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

However, in both cases it seems like the reads being rejected are around 6Kb in length. See below:

screenshot_from_2022-06-16_13-47-09

screenshot_from_2022-06-16_13-59-40

Any tips on what I might be doing wrong?

mattloose commented 2 years ago

The first thing I can see is that you have max_chunks set to 16. In the context of running on PromethION that will grab 16 seconds worth of data before finally unblocking due to max_chunks exceeded. At 400 bases a second that would give you a peak at 6.4 kb - so pretty much what you are seeing here.

However - your toml snippets also appear in complete - at least the barcoded one you are only providing instructions for what to do with unclassified reads? @alexomics can you comment?

mattloose commented 2 years ago

For info - here is a working toml we are currently using. We have max chunks of 2. We provide a category for uncalssified reads, one for classified (but not the barcodes we are looking for) and then one for each barcode we are targeting - in this case 13,14 and 15.

`[caller_settings] config_name = "dna_r9.4.1_450bps_fast_prom" host = "ipc:///tmp/.guppy" port = 5555 barcode_kits = ["EXP-NBD114"] align_ref = "/path/to/mmi"

[conditions] reference = "/path/to/mmi"

[conditions.unclassified] name = "unclassified_reads" control = false min_chunks = 0 max_chunks = 2 targets = [] single_on = "unblock" multi_on = "unblock" single_off = "unblock" multi_off = "unblock" no_seq = "proceed" no_map = "proceed"

[conditions.classified] name = "classified_reads" control = false min_chunks = 0 max_chunks = 2 targets = [] single_on = "unblock" multi_on = "unblock" single_off = "unblock" multi_off = "unblock" no_seq = "proceed" no_map = "proceed"

[conditions.barcode13] name = "barcode13" control = false min_chunks = 0 max_chunks = 2 targets = "/data/path/to/targets" single_on = "stop_receiving" multi_on = "stop_receiving" single_off = "unblock" multi_off = "unblock" no_seq = "proceed" no_map = "proceed"

[conditions.barcode14] name = "barcode14" control = false min_chunks = 0 max_chunks = 2 targets = "/data/path/to/targets" single_on = "stop_receiving" multi_on = "stop_receiving" single_off = "unblock" multi_off = "unblock" no_seq = "proceed" no_map = "proceed"

[conditions.barcode15] name = "barcode15" control = false min_chunks = 0 max_chunks = 2 targets = "/data/path/to/targets" single_on = "stop_receiving" multi_on = "stop_receiving" single_off = "unblock" multi_off = "unblock" no_seq = "proceed" no_map = "proceed"`

And here are the peaks: image

hasindu2008 commented 2 years ago

So we did a PromethION run with 3-barcodes. The configuration used was:

[caller_settings]
config_name = "dna_r9.4.1_450bps_fast_prom"
host = "ipc:///tmp/.guppy"
port = 5555
barcode_kits = ["SQK-MLK111-96-XL"]
align_ref = "/data/readfish/hg38noAlt.idx"

[conditions]
reference = "/data/readfish/hg38noAlt.idx"

[conditions.unclassified]
name = "unclassified_reads"
control = false
min_chunks = 0
max_chunks = 2
targets = []
single_on = "unblock"
multi_on = "unblock"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.classified]
name = "classified_reads"
control = false
min_chunks = 0
max_chunks = 2
targets = []
single_on = "unblock"
multi_on = "unblock"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.barcode10]
name = "barcode10"
control = false
min_chunks = 0
max_chunks = 2
targets = "ReadFish_v7_gene_targets.collapsed.hg38.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.barcode11]
name = "barcode11"
control = false
min_chunks = 0
max_chunks = 2
targets = "ReadFish_v7_gene_targets.collapsed.hg38.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.barcode12]
name = "barcode12"
control = false
min_chunks = 0
max_chunks = 2
targets = "ReadFish_v7_gene_targets.collapsed.hg38.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

Command to readfish was:

readfish targets --device 3D --experiment-name "ReadFish_v7_gene_targets" --toml ReadFish_v7_gene_targets.collapsed.hg38.toml --cache-size 3000 --batch-size 3000 --channels 1 3000 --port 9502

The break_after_reads parameter in MinKNOW was not changed - was left at default.

However, the enrichment was not so good. Digging a bit deeper into the end_reason field in the raw data it seems like even the on-target reads were rejected. Below is a comparison of how those end_reasons are on for both on the PromethION vs how it behaved when we previously did on the MinIONs.

image

Any idea what we are doing wrong? Is it that the max_chunks parameter set at 2 is too low?

mattloose commented 2 years ago

I'm not sure there is anything you are doing wrong here.

We don't expect the enrichment to be as good as on a GridION/minION as (at least for now) the unblock time on promethION is not as fast as on GridION/minION. The drop in efficiency is roughly two fold - I.e there is room for improvement here.

The signal positive/signal negative values you are observing is something we have also seen. In discussion with ont we don't believe this shift to be meaningful. You may have also noticed a lot of reads with odd status in the read length histogram during the run. It's thought that Minknow is struggling to correctly classify the ends of reads at times in this first release.

However, these values (positive/negative) are not telling you anything about unblocked reads. Those are identified in the row above (4).

So - from your table there is no evidence of a significant change in unblock behaviour based on rows 5 and 6. Row 4 suggests that the unblocks are a little less efficient on the promethion (which we expect) but you also have to take into account read length - very short reads can filter through without unblocks or it's possible that a read which initially does not map to a target does when an additional 2 seconds of data are collected.

If you have written out logs from readfish you can track where the original read fragment mapped and the decision that was made and sent to the sequencer.

You could try with more chunks but you run the risk of trying to unblock reads with more than 2kb through the pore. This increases the risk of the unblock failing and the pore becoming permanently blocked.

Do you have a measure of the actual enrichment achieved with the promethion versus the MinION runs? As an example we have seen a three to four fold improvement in coverage over targets when moving from a MinION to a PromethION with similar effective pore counts. One might expect a six fold improvement in coverage based on channel count - but the longer unblock loop reduces that.

Does that help?

hasindu2008 commented 2 years ago

Thank you very much for the very informative response @mattloose. Seems like the enrichment (%_ON) is slightly less on prom compared to the MinION as per the stats. On target median read length is a bit too low on the PromethION compared to the MinION and I could not really get why.

TARGET_COUNT: number of reads
TARGET_SUM: sum of bases
TARGET_MEAN: mean read lengths
TARGET_MEDIAN: median read lengths
TARGET_MEDIAN_COV: median coverage

here are some detailed stats of the runs:

image
mattloose commented 2 years ago

I'm confused. Is this comparing a single GridION/MinION run with a promethION run?

To be clear - I would expect more on target data from a promethion run when compared with an equivalent grid/minion run.

hasindu2008 commented 2 years ago

This is comparing a MinION run with a barcoded-Promethion (3-barcodes) run. In those cases, we did one MinION flowcell per each sample. Then all those three samples were multiplexed and then done on a single PromethION flowcell.

mattloose commented 2 years ago

Ok.

I'd have expected that experiment to be equivalent. Ie you would get the same enrichment.

Let me think about this / I've got some possible explanations.

hasindu2008 commented 2 years ago

Hi

Just to see if this barcode classification had anything, I set the same setting to all the barcodes. Also, at the same time, I increased the max chunk size to 7 to see if providing a large sequence would increase the median on target read length. The TOML file is as below:

[caller_settings]
config_name = "dna_r9.4.1_450bps_fast_prom"
host = "ipc:///tmp/.guppy"
port = 5555
barcode_kits = ["SQK-MLK111-96-XL"]
align_ref = "/data/readfish/hg38noAlt.idx"

[conditions]
reference = "/data/readfish/hg38noAlt.idx"

[conditions.unclassified]
name = "unclassified_reads"
control = false
min_chunks = 0
max_chunks = 7
targets = []
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.classified]
name = "classified_reads"
control = false
min_chunks = 0
max_chunks = 7
targets = []
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.barcode10]
name = "barcode10"
control = false
min_chunks = 0
max_chunks = 7
targets = "ReadFish_v7_gene_targets.collapsed.hg38.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.barcode11]
name = "barcode11"
control = false
min_chunks = 0
max_chunks = 7
targets = "ReadFish_v7_gene_targets.collapsed.hg38.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.barcode12]
name = "barcode12"
control = false
min_chunks = 0
max_chunks = 7
targets = "ReadFish_v7_gene_targets.collapsed.hg38.txt"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

However, the on-target median length is still similar to the off-target median length on PromethION.

image
mattloose commented 2 years ago

I wouldn't expect the chunk number to change this behaviour.

If I am correct your minION runs aren't barcoded - is that correct?

Could you also calculate the medians based on the actual decisions logged by readfish? So look at the unblocked_read_ids.txt file for each run and see if the medians differ as expected?

Finally, could you share your targets file (or a subset thereof?) - understood if you can't for some reason, but it would be helpful to try and understand what is happening.

Thanks

hasindu2008 commented 2 years ago

Yes, MinION runs are not barcoded.

Is that file inside the data directory when MinKnow produces the raw data? If they are available I will do that tomorrow. The median lengths I sent in the previous message were calculated by extracting the reads from the alignment bam file using bedtools.

I will check if that target file can be shared and provide it to you tomorrow.

hasindu2008 commented 2 years ago

Hi,

The targets file is attached. ReadFish_v7_gene_targets.collapsed.hg38.txt

The unblocked_read_ids.txt did not look right. Did check for a MinION run and the median length was around 700~ for the readIDs present in unblocked_read_ids.txt. but some of those readIDs in unblocked_read_ids.txt were not present in the basecalled FASTQ file (passed+fail). The rest of the reads in the FASTQ that were not in unblocked_read_ids.txt also had a median of around ~700.

mattloose commented 2 years ago

It's expected that read ids in the unblocked_read_ids.txt are not present in the final pass+fail output. When running adaptive sampling you grab all potential read data but some of those events may be filtered out by further post processing in MinKNOW and end up not being written to disk and so are lost from the final data set.

Thanks for the target file - we will see if we can reproduce problems here.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

mattloose commented 11 months ago

Closing as this issue should be resolved in the latest version of readfish. Please reopen if you find this not to be resolved. If its a new issue, please create a new issue instead of reopening.