Kirk3gaard / 2023-basecalling-benchmarks

MIT License
27 stars 0 forks source link

Q about duplex reads #1

Open aistBMRG opened 1 year ago

aistBMRG commented 1 year ago

Hi,

Thanks very much for the benchmark -- interesting and in line with our in-house findings.

Now, small question. This is all based on simplex reads? Did you use duplex_tools to split reads? Actually, I tried guppy_basecaller previously to split the duplex reads but that seemed not to work (no reads annotated with _1 or _2, similar to duplex tools ...) ... any experience with this?

Also, after running dorado, did you perform any adapter trimming prior to assembly, maybe using guppy_barcoder? It seems that Flye is quite good at handling reads with adapters but just curious regarding your workflow.

My workflow now for multiplexed samples is to run dorado, then split into simplex reads using duplex_tools (duplex reads represent <5% of reads in our current dataset so I just ended up splitting the reads without duplex basecalling), then run guppy_barcoder for multiplexing and concurrent adapter/barcode trimming, remove CS DNA reads using nanolyse, assemble with flye and consensus generation using medaka ... does that sound reasonable to you?

Thanks for any input - it is very appreciated.

Dieter

Kirk3gaard commented 1 year ago

Hi Dieter

Really great to hear that it is in line with your in house results. Really impressed with the performance of the nanopore only assemblies that can be achieved with R10.4.1 data.

I did not do any adapter trimming. Flye seems to take care of that pretty well (https://github.com/fenderglass/Flye/issues/100#issuecomment-483867054).

Duplex fraction for this run was also pretty low as I loaded too much DNA. So that is why duplex data is not included in the benchmark even though it is even better for the assemblies. I just did not have enough data for doing proper subsetting.

Yes your workflow seems to be exactly what we are doing as well. Really convenient to be able to limit polishing to one round of medaka now and no other tools or sequencing data types.

Best regards Rasmus

aistBMRG commented 1 year ago

Hi Rasmus,

Thank for the quick response -- very much appreciate it. Yes, using only ONT reads makes things much easier, quite amazing the accuracy that can be achieved now.

Just a small remaining question. I guess R10.4.1 data can/will contain chimeric reads. In your analysis, you did not need to split the reads based on adapters, as your lab described in the Nature Methods paper, using duplex_tools? Just getting started again with ONT sequencing after a break and things are moving so quickly so my knowledge is quickly outdated ...

Thanks!

Dieter

Kirk3gaard commented 1 year ago

Hi Dieter

You are right. R10.4.1 data will contain chimeric reads from molecules that have been ligated together. I did not take that into account in this analysis these reads are straight from the basecaller except q10 filter for HAC and SUP. How much of the data that consists of chimeric reads will vary depending on the library prep so it might be safer to at least run the splitting tool and check whether it could be a problem. It has been very clear when we have been sequencing amplicons in the past that we have had read length peaks for multiple lengths of the 16S.

Good luck with your next nanopore journey it is a really good time to get revisit nanopore data as it is now sufficient as a stand alone technology for bacterial genomes, more robust, requires lower input amounts, simpler bioinformatics etc. We have been really happy with the kit14 chemistry so far and hope that it will remain stable for some time as it is "good enough" for our needs and stability improves the chance that we will get a full set of modification models etc.

Best regards Rasmus

aistBMRG commented 1 year ago

Thanks a lot for your input, Rasmus. I really appreciate it.

Always looking forward to new results from your lab/work!

Best,

Dieter

aistBMRG commented 1 year ago

Hi again,

Sorry but let me bug you with a small question if you have time. It seems that you are using "r1041_e82_400bps_sup_g615" for medaka polishing for reads generated by basecalling with dorado. Is there any reason you are not using a dorado model, which may be "r1041_e82_400bps_sup_v4.0.0" I guess? It is difficult to keep track of all the different models nowadays ... A version v4.1.0 for medaka seems not available yet.

Thanks a lot.

Dieter

Kirk3gaard commented 1 year ago

Hi Dieter

Thanks for spotting this. There were no medaka models for dorado when I started this. Will probably use the 4.0.0 model for both then.

Best regards Rasmus

Kirk3gaard commented 1 year ago

Noticed that the updated conda version of medaka is not available yet. https://github.com/nanoporetech/medaka/issues/422#issue-1609835729

aistBMRG commented 1 year ago

I had installed using pip in a conda environment, this seemed to work for version 1.7.3.

Kirk3gaard commented 1 year ago

Yeah that seemed to do the trick (plots have been updated).

Also noticed that there was no updated medaka model for fast 4.0.0 data so used a guppy model for that as well and it still seems to offer a decent improvement.

aistBMRG commented 1 year ago

Thanks for the update. I'll check the data/figures.

It seems that polishing with medaka has minimal effect when using sup accuracy basecalling, right? For the bacterial genomes I am working on, without references for comparison, medaka of assemblies generated with flye (150x coverage, subset from ~300x coverage using filtlong with weight of 30 for quality) sometimes only makes a few (<10) changes. For others, the number of changes can ~200 but not sure whether polishing may introduce some errors ... Do you have any insights here based on your data?

Thanks!

Dieter