JensUweUlrich / ReadBouncer

Fast and scalable nanopore adaptive sampling
GNU General Public License v3.0
33 stars 2 forks source link

cannot run with deplete_files #69

Open bjaysheel opened 10 months ago

bjaysheel commented 10 months ago

Hi, I am running an experiment where ideally I would like to provide a host genome and use it as a deplete file, as host will be the constant when working out in the field. I am able to generate a index file (ibf) from the host genome file, but when I run readbouncer, I don't get any output for about a minute or so and then only message I get is "killed" (nothing in the log file either). Here is my toml file

`usage = "target" output_directory = "/nanopore/ReadBouncer-1.2.2-Linux/output/Bird_depleted" log_directory = "/nanopore/ReadBouncer-1.2.2-Linux/logs"

[IBF] kmer_size = 15 fragment_size = 200000 threads = 8 deplete_files = ['/nanopore/ref/Bird_contigs_samba_M3000.ibf']

[MinKNOW] host = "127.0.0.1" port = "9502" flowcell = "MN44041"

[BaseCaller] caller = "Guppy" host = "ipc:///home/dbi" port = 5000 threads = 8 config = "dna_r10.4.1_e8.2_400bps_hac"`

I am not sure what I am doing wrong.

However if I use similar config where instead of deplete_files I use target_files (in this case I know the target so its okay but not the solution I want in the field) readbouncer seems to work. Here is the toml file for target_files

`usage = "target" output_directory = "/nanopore/ReadBouncer-1.2.2-Linux/output/Pathogen_target" log_directory = "/nanopore/ReadBouncer-1.2.2-Linux/logs"

[IBF] kmer_size = 15 fragment_size = 200000 threads = 8 target_files = ['/nanopore/ref/Pathogen_contigs_samba_M1000.ibf']

[MinKNOW] host = "127.0.0.1" port = "9502" flowcell = "MN44041"

[BaseCaller] caller = "Guppy" host = "ipc:///home/dbi" port = 5000 threads = 8 config = "dna_r10.4.1_e8.2_400bps_hac" `

I hoping you guys can help to find out why I can not get readbouncer to work with deplete_files.

Guppy sever setting is unchanged between two runs, and the guppy config file is the exact same; sample, device, device name and computer; all else is constant only thing change's is that instead of target_files I use deplete_files and the two files are different. Deplete file is 13GB large when compared to 1.2G target file, but I do have 32GB memory on the machine so I would argue that memory can't be the issue. minKnow basecalling is turned off as well.

I only have 1 last flowcell left, so at most 2 runs, i.e not much room for error. Hoping you guys can help

Thank you Jaysheel

bjaysheel commented 10 months ago

Is there a limit on how large IBF file can be? I seems that readbouncer can't load the index file it has 43766 bins (43415 seq). If I provide the FASTA file IBF is converted and then program is "killed" but if I provide a smaller file (3783 bins) readbouncer starts up as expected.

readbouncer is running on a 32G machine, and input IBF file is 13G.

JensUweUlrich commented 9 months ago

Dear Jaysheel,

Sorry for not getting back to you sooner. I was busy with my PhD defense last week ;-)

44k bins is a lot and will definitely harm ReadBouncer's classification performance, which may causes the error. With a k-mer size of 15, you could also increase the fragment size to 400,000, which should reduce the number of bins by 50%. And with the 10.4.1_hac basecalling model you could also increase the k-mer size to 17 and the fragment-size to 500,000 because the lower error rates with the new pores and basecalling models should allow for larger k-mer values without impacting the classification sensitivity and specificity.

Please try to rebuild the IBF with those parameters. Please notify me if it works out. If not, I would like to reproduce the error and debug the code.

Looking forward to hear from you Best Jens

bjaysheel commented 9 months ago

Hi Jens, I took your recommendation and recreated the index file with k-mer size of 17 and 500,000 fragments. It didn't do much to reduce the number of bins, I now have have 43416 bins and there are 43415 sequences in the input sequence file. I should note that I had to run this a server as I got "killed" message when running on the local desktop machine with 32GB of ram.

While waiting for your reply I was running a simulation which failed with segmentation fault, and there are no errors to report. Here is the setup.
  Ran minKnow using Mk1B, minKnow was setup for simulation run with basecalling turned on, guppy server running and readbouncer with target ibf file provided.  After few hours of  running readbouncer stopped with "Segmentation Fault"  there are no error messages in any of the logs files.  I should also note that while readbouncer was running it created TargetReads.fasta file and DepletedReads.fasta file but these were created in the root folder from where I ran to program and not in the output folder specified in the toml file.  The target ibf file has 3783 bins.

 I am wondering if readbouncer is running in to memory issues,  is there a way to setup readbouncer and guppy on a remote machine which has lot more memory compared to 32GB, while running minKnow on a local desktop.

Thank you Jaysheel

JensUweUlrich commented 9 months ago

Hi Jaysheel,

Now I get it. You have a draft genome consisting of 43k contigs. What you could do as a workaround is simply concatenating the contigs into larger stretches and than let ReadBouncer create the fragments for the binning process during indexing. Although this sounds a bit weird, it will not affect ReadBouncer's ability to classify the read prefixes as host reads. In contrast, you will even reduce the number of bins tremendously, which should reduce the computational demands as well. You can also contact me directly if this still does not resolve the issue.

Best Jens