LooseLab / readfish

CLI tool for flexible and fast adaptive sampling on ONT sequencers
https://looselab.github.io/readfish/
GNU General Public License v3.0
169 stars 33 forks source link

Barcoded adaptive sampling not working as expected #289

Closed jamesemery closed 12 months ago

jamesemery commented 12 months ago

I have been experimenting with running barcoded sequencing but I have found that the barcoded experiments don't seem to successfully increase coverage across targeted regions of the genome like I have observed when runnning without barcoded sequencing. I based my config files that I'm using off of the discussion here #202 and here is my configuration:

[caller_settings]
config_name = "dna_r10.4.1_e8.2_400bps_5khz_fast.cfg"
host = "ipc:///home/prom/"
port = 5556
barcode_kits = ["SQK-NBD114-24"]
align_ref = "/data/Homo_sapiens_assembly19.mmi"

[conditions]
reference = "/data/Homo_sapiens_assembly19.mmi"

[conditions.unclassified]
name = "unclassified_reads"
control = false
min_chunks = 0
max_chunks = 12
targets = []
single_on = "unblock"
multi_on = "unblock"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.classified]
name = "classified_reads"
control = false
min_chunks = 0
max_chunks = 12
targets = []
single_on = "unblock"
multi_on = "unblock"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.barcode03]
name = "barcode_3_evens"
control = false
min_chunks = 0
max_chunks = 12
targets = "/home/prom/TargetedSequencing/beds/titrationbeds/HG002_SVs_20kbnuc.40000padding.to_3_percent.even_chr_40000_padding.bed"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.barcode05]
name = "barcode_5_odds"
control = false
min_chunks = 0
max_chunks = 12
targets = "/home/prom/TargetedSequencing/beds/titrationbeds/HG002_SVs_20kbnuc.40000padding.to_3_percent.odd_chr_40000_padding.bed"
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[conditions.barcode08]
name = "barcode_08_notarget"
control = false
min_chunks = 0
max_chunks = 12
targets = []
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "stop_receiving"
multi_off = "stop_receiving"
no_seq = "stop_receiving"
no_map = "stop_receiving"

When I run Guppy for basecalling i use this configuration: /opt/ont/guppy/bin/guppy_basecall_server --port 5556 --config dna_r10.4.1_e8.2_400bps_5khz_fast.cfg --log_path /home/prom/targetedscripts/guppy_basecaller_logs --ipc_threads 1 --max_queued_reads 3000 --chunks_per_runner 48 --num_callers 16 -x cuda:all

My version of Readfish that I'm using on the PromethION device is git+https://github.com/LooseLab/readfish@issue251. Should this version be expected to do targeting correctly? Furthermore when I run this experiment, on MinKNOW I set the configuration to the "SKQ-NBD114-24" chemistry but I disable basecalling/barcode splitting from MinKNOW as I have discovered that these were not necessary for un-tagged targeting before.

Digging through the targeted sites, it seems that barcoded sequencing with m thses configurations is failing to properly surpress the majority of off-target long reads (first 3 tracks) compared with a successfully non-barcoded run (last track). Notice how many long off-target reads there are in all of the barcode tracks that were not suppressed correctly, this seems to have the effect of making this experiment the same as if I had done no targeting at all with these libraries. Am I running this experiment correctly? I suspect that there is some detail about the configuration/versions that is resulting in this very poor yield when compared with my other experiments... Thank you.

Screenshot 2023-10-11 at 11 25 48 AM
github-actions[bot] commented 12 months ago

Thank you for your issue. Give us a little time to review it.

PS. You might want to check the FAQ if you haven't done so already.

This is an automated reply, generated by FAQtory

mattloose commented 12 months ago

Hi James,

Thanks for the very detailed issue. I agree that it does look like this has not worked as it should do.

From the configuration you have supplied I can't immediately see why it hasn't worked (which is always slightly worrying).

We have literally just pushed a new version of the readish code and I would suggest switching to this. I would also suggest running the new readfish validate command on your toml file (which - by the way - will need slight editing to fit the new format - see https://looselab.github.io/readfish/toml.html for more details).

This new validate will make sure that things like the contig names in the reference match the contig names in your bed file etc. I'm sure that isn't the explanation, but it could be.

It will also allow you to use minimap2 outside of guppy and so we get a better handle on the mapping.

Please post any updates on the output of readfish validate here.

jamesemery commented 12 months ago

Hi @mattloose thank you for the speedy reply! I will take a look at updating and report back to once I at least have it running on simulations to be sure that everything seems right on the surface...

Should the new version ("2023.1.0" I presume???) be expected to work on the PromethION? I recently had to go through some back and forth to get the PromethION branch of Readfish working on that instrument and I just want to be clear that its not a known issue if I try to upgrade #251.

mattloose commented 12 months ago

One of the things we are happiest about here is the fact that this should work on promethION or gridion or minion etc etc etc.

We had so many branches handling so many issues we've spent some time consolidating them into one - hopefully improved - branch.

We have run this code on P2solo, GridION and PromethION.

Also do test on simulations wherever you can!

mattloose commented 12 months ago

And yes - 2023.1.0!

jamesemery commented 12 months ago

I am testing a barcoded Bulkfile with the new version right now!

I'm not sure if I should open a separate issue for this but after validating my new .toml file and having it pass validation I was getting the following errors when i tried to run it on my simulation:

Barcode unclassified_reads has targets on 0 contigs, with 0 found in the provided reference.
This barcode has 0 total targets (+ve and -ve strands), covering approximately 0.00% of the genome.

Barcode classified_reads has targets on 0 contigs, with 0 found in the provided reference.
This barcode has 0 total targets (+ve and -ve strands), covering approximately 0.00% of the genome.

Barcode barcode_5_odds has targets on 11 contigs, with 11 found in the provided reference.
This barcode has 2152 total targets (+ve and -ve strands), covering approximately 2.10% of the genome.

Barcode barcode_08_notarget has targets on 0 contigs, with 0 found in the provided reference.
This barcode has 0 total targets (+ve and -ve strands), covering approximately 0.00% of the genome.

2023-10-11 19:50:03,645 readfish.targets readfish started in PHASE_SEQUENCING. Fully sequencing first read from each channel.
Traceback (most recent call last):
  File "/home/prom/miniconda3/envs/readfish2/bin/readfish", line 8, in <module>
    sys.exit(main())
  File "/home/prom/miniconda3/envs/readfish2/lib/python3.10/site-packages/readfish/_cli_base.py", line 61, in main
    raise SystemExit(args.func(parser, args, extras))
  File "/home/prom/miniconda3/envs/readfish2/lib/python3.10/site-packages/readfish/entry_points/targets.py", line 521, in run
    worker.run()
  File "/home/prom/miniconda3/envs/readfish2/lib/python3.10/site-packages/readfish/entry_points/targets.py", line 421, in run
    region_name=self.conf.get_region(result.channel).name,
AttributeError: 'NoneType' object has no attribute 'name'

It turns out (evidently since adding this fixed the issue) that I needed to have a region specified in my new toml for barcoded calling(shown below). The validation script should probably catch this issue. (+1 for all of the new example toml files and decimation around them now though this is much nicer to use)

[[regions]]
# Configuration for the first experimental region.
# This region has no targets as they are derived from the barcodes below.
name = "Experimental"
min_chunks = 1
max_chunks = 4
targets = []
single_on = "proceed"
multi_on = "proceed"
single_off = "proceed"
multi_off = "proceed"
no_seq = "proceed"
no_map = "proceed"
above_max_chunks = "proceed"
below_min_chunks = "proceed"
alexomics commented 12 months ago

Thank you for testing this out and the feedback! Was the TOML just something like this:

[caller_settings.guppy]
...

[mapper_settings.mappy]
...

[barcodes.unclassified]
...

[barcodes.classified]
...

[barcodes.barcode01]
...
jamesemery commented 12 months ago

^correct except with 2 barcodes. Adding [[regions]] to it evidently fixed the issue

mattloose commented 12 months ago

Thanks - we will double check this!

mattloose commented 12 months ago

@jamesemery we hope to push a fix to this in the morning (UK time). Apologies and thanks for spotting it!

mattloose commented 12 months ago

Hi @jamesemery - we have just pushed a release to pypi which fixes this issue. You will no longer need to provide a reigon to the toml file.

Please go ahead and try again with the new version (2023.1.1) which should now work.

I will close this issue for now, but feel free to reopen if the issue is not fixed.