LooseLab / readfish

CLI tool for flexible and fast adaptive sampling on ONT sequencers
https://looselab.github.io/readfish/
GNU General Public License v3.0
169 stars 33 forks source link

Odd Playback Results #162

Closed carze closed 3 years ago

carze commented 3 years ago

Hi,

I've recently been attempting to get Readfish running in our setup described below:

CPU: AMD Ryzen 9 3950X (16C / 32T) Memory: 128GB RAM HDD: Samsung 970 Evo Plus 1TB GPU: 2080Ti OS: Ubuntu 18.04 LTS

Installed and testing the latest dev version of Readfish proceeds smoothly until I get to the targets test and then I see some pretty odd behavior. I should preface this all by stating that I am using Guppy 5.0.11 and MinKNOW 21.06.2 which I know have been tested very sparsely so I admit that these issues might be entirely related to the versions of Guppy/MinKNOW I am using but I figured I'd ask to get any possible input here.

All base calling is done in the fast mode and it seems like in the newer versions of Guppy adaptive scaling isn't enabled by default (or at least that line is missing from the configs) but upon kicking off readfish with the test bulk playback file I see mapping speeds that are slower than what were shown in the tutorial:

2021-10-01 20:42:42,650 ru.ru_gen 77R/0.06052s
2021-10-01 20:42:43,049 ru.ru_gen 48R/0.05793s
2021-10-01 20:42:43,432 ru.ru_gen 13R/0.03938s
2021-10-01 20:42:43,849 ru.ru_gen 54R/0.05584s
2021-10-01 20:42:44,246 ru.ru_gen 45R/0.05276s
2021-10-01 20:42:44,635 ru.ru_gen 25R/0.04073s
2021-10-01 20:42:45,052 ru.ru_gen 53R/0.05656s
2021-10-01 20:42:45,455 ru.ru_gen 51R/0.05959s
2021-10-01 20:42:45,833 ru.ru_gen 21R/0.03698s
2021-10-01 20:42:46,252 ru.ru_gen 55R/0.05498s
2021-10-01 20:42:46,651 ru.ru_gen 43R/0.05389s
2021-10-01 20:42:47,052 ru.ru_gen 35R/0.05393s
2021-10-01 20:42:47,449 ru.ru_gen 44R/0.04974s
2021-10-01 20:42:47,868 ru.ru_gen 60R/0.06789s

Although these numbers don't seem that far off. What's interesting here is that as the run proceeded the mapping rate eventually slowed down to around 0.08 - 0.15s. Upon seeing these mapping rates I thought I might be able to get away with running readfish but the histograms in MinKNOW seemed to indicate otherwise:

Screen Shot 2021-10-01 at 9 46 54 PM

This was after letting the run for about an hour looks nothing like the output/images provided in the tutorial. These images look very similar to those seen in #149 in which it also seems like Guppy 5.0.x / a newer version of MinKNOW was used so maybe this is a MinKNOW thing. The last wrinkle here is that when I end up stopping the run and running readfish summary on the output I see the following:

readfish summary ~/Documents/work/ringtx/tools/readfish/human_chr_selection.toml /var/lib/minknow/data/test_RU_human_chr_21_22/test_RU_chr_21_22/20211001_1845_MN29879_test_bulk_run_41d7c068/fastq_pass/
Using reference: /home/carze/Documents/work/ringtx/references/human/chm13.mmi
contig  number       sum  min     max    std   mean  median    N50
  chr1   14597   8345192  187   93643   1121    572     504    535
 chr10    8403   4999082  161  126100   1880    595     515    548
 chr11    9599   5216965  192   36992    508    543     502    520
 chr12    8376   4886880  173   64047   1188    583     511    535
 chr13    4828   3000785  195  176881   3121    622     490    527
 chr14    6422   4156166  182  129353   2966    647     512    572
 chr15    9736   6519545  171  394796   5107    670     503    568
 chr16    3758   2257347  214  103233   2035    601     504    557
 chr17    5539   3180446  219   48771    930    574     515    541
 chr18    7112   4106013  179  236086   2872    577     496    521
 chr19    4543   2648127  179   32335    679    583     516    556
  chr2   17744  10273244  142  277150   2813    579     497    528
 chr20    3495   2096325  221   77321   1525    600     517    556
 chr21      55   1789331  472  332036  59349  32533   10151  84124
 chr22      55    790894  269  119782  20629  14380    7663  31618
  chr3   13232   7805759  175  252563   2919    590     510    536
  chr4   12656   7720369  210  215829   3036    610     501    537
  chr5   12701   7513036  149  124741   2324    592     499    531
  chr6   10947   6138898  221   72181    923    561     511    528
  chr7   10589   6892862  181  393389   5222    651     506    559
  chr8    8440   4920215  149  223359   2962    583     500    530
  chr9   10339   6796179  152  360227   5076    657     507    586
  chrM     406    277848  246   11089    818    684     604    657
  chrX    8853   5688127  168  190757   3645    643     506    566

These seem to look alright? In fact they almost look too good to be true? Are these truly believable results? Any thoughts on what I'm seeing here?

Thanks, Cesar

mattloose commented 3 years ago

Hi

sorry for the delay in getting back to you. So the short answer is that this does appear to be working.

Your batch processing times are a little higher than ours but your batch sizes are larger. The important thing is they are less than the chunk size of data you are collecting.

The Minknow histogram is interesting- I’d like to see that with “hide outliers” unchecked and “split by end reason” checked. I expect you have a lot of short reads caused by unblocks and few long reads from selected chromosomes.

Your analysis of the data suggests it is working as you are seeing a clear difference in on and off target read lengths.

Your result does differ from that reported in the other issue in that your analysis is showing a difference in read length. All the numbers look good and this should work for you in a real experiment.

Hope that helps.