LooseLab / readfish

CLI tool for flexible and fast adaptive sampling on ONT sequencers
https://looselab.github.io/readfish/
GNU General Public License v3.0
163 stars 31 forks source link

High count of no-barcode that map to my targets #363

Closed lborcard closed 2 weeks ago

lborcard commented 2 weeks ago

Dear Readfish team,

Thank you for your work. I am using readfish to enrich for certain targets and I am testing it with a simulated run ( 2hours of WGS). It seems that since dorado there is a work around to deal with the empty barcode signal (ref issue commented by matt Loose). I generally am under the impression that I find a lot of no_barcode_found and unclassified barcodes when doing a simulation with readfish. I was wondering if this was due to the simulation run or an issue linked to Dorado ?

Readfish simulated run with AS

Condition          ,Reads     ,Alignments      ,Alignments          ,Alignments ,Yield             ,Yield               ,Yield     ,Yield   ,Median read lengths ,Median read lengths ,Median read lengths ,Number of targets ,Percent target ,Estimated coverage
                   ,Total     ,On-Target       ,Off-Target          ,Total      ,On-Target         ,Off-Target          ,Total     ,Ratio   ,On-target           ,Off-target          ,Combined            ,Total             ,Total
barcode01          ,"63,753"  ,"4,227 (6.63%)" ,"59,526 (93.37%)"   ,"63,753"   ,10.12 Mb (23.36%) ,33.22 Mb (76.64%)   ,43.34 Mb  ,1:3.28  ,0 b                 ,556 b               ,556 b               ,0                 ,0.00%          ,0.00 X
barcode02          ,"74,569"  ,"5,005 (6.71%)" ,"69,564 (93.29%)"   ,"74,569"   ,12.28 Mb (23.53%) ,39.91 Mb (76.47%)   ,52.20 Mb  ,1:3.25  ,0 b                 ,565 b               ,565 b               ,0                 ,0.00%          ,0.00 X
classified_reads   ,77        ,0 (0.00%)       ,77 (100.00%)        ,77         ,0 b (0.00%)       ,88.17 Kb (100.00%)  ,88.17 Kb  ,0:0.00  ,0 b                 ,1.24 Kb             ,1.24 Kb             ,7                 ,100.00%        ,0.00 X
no_barcode_reads   ,"432,798" ,"9,232 (2.13%)" ,"423,566 (97.87%)"  ,"432,798"  ,22.41 Mb (5.21%)  ,407.29 Mb (94.79%)  ,429.70 Mb ,1:18.18 ,0 b                 ,961 b               ,961 b               ,0                 ,0.00%          ,0.00 X
unclassified_reads ,"294,399" ,0 (0.00%)       ,"294,399 (100.00%)" ,"294,399"  ,0 b (0.00%)       ,334.07 Mb (100.00%) ,334.07 Mb ,0:0.00  ,0 b                 ,1.30 Kb             ,1.30 Kb             ,0                 ,0.00%          ,0.00 X

Original WGS run:

barcode         alias           type            target_unclassified     acquisition_run_id                          protocol_group_id                       sample_id   flow_cell_id    started
barcode01       barcode01       test_sample                  144944     769dbe7dcada77ac43bd9d77ac63a7cd03d1e6d0    sample  no_sample   FAY86484        2024-05-15T12:37:41.871230+02:00
barcode02       barcode02       test_sample                  170079     769dbe7dcada77ac43bd9d77ac63a7cd03d1e6d0    sample  no_sample   FAY86484        2024-05-15T12:37:41.871230+02:00
unclassified    unclassified    na                            36927     769dbe7dcada77ac43bd9d77ac63a7cd03d1e6d0    sample  no_sample   FAY86484        2024-05-15T12:37:41.871230+02:00

best,

Loïc

github-actions[bot] commented 2 weeks ago

Thank you for your issue. Give us a little time to review it.

PS. You might want to check the FAQ if you haven't done so already.

This is an automated reply, generated by FAQtory

mattloose commented 2 weeks ago

When you run readfish on playback with a barcoded sample you can get very unexpected results. What you are seeing here is a consequence of how playback works. If you imagine a single channel on the sequencer, playback provides the signal that passed through the channel exactly as it happened on the original run. If you send a message to the sequencer to throw away the read, the the sequencer will break the playback read at that point and then start a new read from the signal that is coming from the playback file. That is not a true read and so it will therefore not begin with a barcode - you are starting half way through a read and it is correctly classified as unclassified (if that makes sense!).

I hope that helps.

lborcard commented 2 weeks ago

Thank you for the answer, I was wondering how it worked. So in general a simulated run will underestimate the number of true hits? Do you usually run several simulated run to get an idea of the perfomance of AS?

lborcard commented 2 weeks ago

Do you usually get a better idea of the performance by not testing with barcoded experiments?

mattloose commented 2 weeks ago

If you wish to simulate barcoded runs (or non barcoded runs) you can use our icarust tool:

https://academic.oup.com/bioinformatics/article/40/4/btae141/7628125

It all depends on what you are trying to test with respect to performance as to whihc approach is better.

lborcard commented 2 weeks ago

The idea for us is to compare A wgs run to adaptive sampling basically. So we perform a wgs run and we then run a simulated AS run for the same amount of time using the bulk file.

mattloose commented 2 weeks ago

OK - assume you are targetting a 1mb region of a 10mb genome. When you run your normal WGS (and record a bulkfile) you obtain 100x coverage of the 10 mb genome (and therefore 100x of your 1mb region). When you run adaptive sampling on the playback of that bulkfile you will still end up with approximately 100x coverage of your 10 mb genome. The reason for this is that in playback you don't actully throw away the read - you merely break a read that you don't want into smaller bits. So to see the effect of adaptive sampling on you run on playback you would need to look at the read lengths on your 1mb target region vs your 9mb of off-target. You should see long reads "on-target" and short reads "off-target".

In contrast, Icarust will simulate reads being removed, but does so from simulated data. With icarust you will see actual enrichment - but it is more theoretical.

lborcard commented 2 weeks ago

I was also wondering how the playback works when your amount of "on target" reads is limited (because of WGS), what happens when there no more reads corresponding to your ref (say your 1mb region)?

I saw the release of Icarust, I was under the impression that it was not yet finished but I will try it eventually, my initial thought was that playback runs were a more realistic approach to testing but apparently I was wrong. thanks for the explanation it helps a lot!

mattloose commented 2 weeks ago

I wouldn't say you are wrong... it's just that playback doesn't actually remove the molecule from the sequencer and so it's really easy to misinterpret the results. We use playback to look at the relative lengths of the molecules we get on and off target. They should be as short as possible off target and as long as they orginally were on target. You can also cehck mapping efficiency using playback. You just won't see any actual enrichment!

Hope all these comments are useful :-)

Adoni5 commented 2 weeks ago

image

If you look at the yield columns for playback, you can see what Matt is talking about. You can't ever enrich in playback as you can't create new reads, only chunks of the original reads.

lborcard commented 2 weeks ago

Ok gotcha! thanks for the plots very informative. How does Minknow pick the next read to come in when using playback, does the initial "timeline" hold or is it randomly picked in the bag of molecules that went through this channel during the original run?

mattloose commented 2 weeks ago

The initial timeline holds - so reads playout exactly as they were recorded. All that happens is the signal from a read gets broken up into smaller chunks.

lborcard commented 2 weeks ago

So what happens to reads that are broken down, the next read will come in at the same time as it was originally sequenced? the pore just waits ?

mattloose commented 2 weeks ago

No. Imagine you have a read of 10 kb. The sequencer plays back the 1st 1000 bases, but you then "unblock" so the sequencer ends the reads at 1kb. But the signal from the original 10kb read keeps playing, so you will get a new read starting at 1kb (plus a small bit) into the old read. And so on - until the original read finishes - and then it goes on to the next read. So, your original 10kb read could be chopped up into 10 fragments of 1 kb each.

Does that make sense?

That is why you end up with reads without a barcode when you run adaptvie sampling on a playback run.

lborcard commented 2 weeks ago

I see what you mean it is much clearer now, so while the single of the unblocked read is playing another reads can be playing in the same "pore" and thus chopping it ?