JensUweUlrich / ReadBouncer

Fast and scalable nanopore adaptive sampling
GNU General Public License v3.0
33 stars 2 forks source link

Error in readme #48

Closed gchevignon closed 2 years ago

gchevignon commented 2 years ago

Hello, Thanks a lot for your tool ! I am only at the classify step but this tool seems amazing. I have struggled a lot making the first test working because there is an error in your readme: In all your exemple of code for the toml file you use : deplete_files = ['xxx.fasta', 'xxx.fasta'] instead of depletion_files = ['xxx.fasta', 'xxx.fasta']. Best Germain

lutfia95 commented 2 years ago

Hi @gchevignon, Thanks for your interest in ReadBouncer. In the config.toml file and the code examples is correct written as deplete_files. In the README.md there are some depletion_files insetad of deplete_files. Thanks for pointing that out, I will change it now.

gchevignon commented 2 years ago

Hi thanks for your reply ! Yes I understood there was no error in the code 1min before your reply ... But there is something weird...

If I use this toml:

usage = "classify" output_directory = '/home/userlocal/work/test_readbouncer.dir/' log_directory = '/home/userlocal/work/test_readbouncer.dir/'

[IBF]

kmer_size = 13
fragment_size = 100000
threads = 12
target_files = ['/home/userlocal/work/test_readbouncer.dir/Bacillus_subtilis_complete_genome.ibf', '/home/userlocal/work/test_readbouncer.dir/Enterococcus_faecalis_complete_genome.ibf', '/home/userlocal/work/test_readbouncer.dir/Escherichia_coli_complete_genome.ibf'] deplete_files = ['/home/userlocal/work/test_readbouncer.dir/Saccharomyces_cerevisiae_draft_genome.ibf'] read_files = ['/home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta'] exp_seq_error_rate = 0.1 #(unsigned float between 0 and 1; default: 0.1) chunk_length = 250 #(unsigned integer; default: 250) max_chunks = 5 #(unsigned integer; default: 5)

I got this:

/usr/local/ReadBouncer/bin/ReadBouncer /home/userlocal/Bureau/Scripts.dir/ReadBouncer.dir/config_classify_test.toml 41 bins were loaded in 0.00248442 seconds from the IBF 29 bins were loaded in 0.000980902 seconds from the IBF 50 bins were loaded in 0.000881843 seconds from the IBF 159 bins were loaded in 0.0077419 seconds from the IBF

Classification results of: /home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta

------------------------------- Final Results ------------------------------- Number of classified reads : 5198 Number of of too short reads (len < 250) : 0 Number of all reads : 100000 Bacillus_subtilis_complete_genome : 0 0 Enterococcus_faecalis_complete_genome : 0 0 Escherichia_coli_complete_genome : 0 0 Average Processing Time Read Classification : 0.000388538

Real time : 40.4076 sec CPU time : 40.9985 sec Peak RSS : 106 MByte

then if I use this toml: (remove deplete line)

usage = "classify" output_directory = '/home/userlocal/work/test_readbouncer.dir/' log_directory = '/home/userlocal/work/test_readbouncer.dir/'

[IBF]

kmer_size = 13
fragment_size = 100000
threads = 12
target_files = ['/home/userlocal/work/test_readbouncer.dir/Bacillus_subtilis_complete_genome.ibf', '/home/userlocal/work/test_readbouncer.dir/Enterococcus_faecalis_complete_genome.ibf', '/home/userlocal/work/test_readbouncer.dir/Escherichia_coli_complete_genome.ibf']

read_files = ['/home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta'] exp_seq_error_rate = 0.1 #(unsigned float between 0 and 1; default: 0.1) chunk_length = 250 #(unsigned integer; default: 250) max_chunks = 5 #(unsigned integer; default: 5)

I got this :

/usr/local/ReadBouncer/bin/ReadBouncer /home/userlocal/Bureau/Scripts.dir/ReadBouncer.dir/config_classify_test.toml 41 bins were loaded in 0.00262547 seconds from the IBF 29 bins were loaded in 0.000993481 seconds from the IBF 50 bins were loaded in 0.000893316 seconds from the IBF

Classification results of: /home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta

------------------------------- Final Results ------------------------------- Number of classified reads : 100000 Number of of too short reads (len < 250) : 0 Number of all reads : 100000 Bacillus_subtilis_complete_genome : 13396 0.13396 Enterococcus_faecalis_complete_genome : 12106 0.12106 Escherichia_coli_complete_genome : 14760 0.1476 Average Processing Time Read Classification : 6.35708e-05

Real time : 7.96554 sec CPU time : 16.4309 sec Peak RSS : 59 MByte

then if I use this toml: remove target line

usage = "classify" output_directory = '/home/userlocal/work/test_readbouncer.dir/' log_directory = '/home/userlocal/work/test_readbouncer.dir/'

[IBF]

kmer_size = 13
fragment_size = 100000
threads = 12

deplete_files = ['/home/userlocal/work/test_readbouncer.dir/Saccharomyces_cerevisiae_draft_genome.ibf'] read_files = ['/home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta'] exp_seq_error_rate = 0.1 #(unsigned float between 0 and 1; default: 0.1) chunk_length = 250 #(unsigned integer; default: 250) max_chunks = 5 #(unsigned integer; default: 5)

I got this:

/usr/local/ReadBouncer/bin/ReadBouncer /home/userlocal/Bureau/Scripts.dir/ReadBouncer.dir/config_classify_test.toml 159 bins were loaded in 0.00787578 seconds from the IBF

Classification results of: /home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta

------------------------------- Final Results ------------------------------- Number of classified reads : 5198 Number of of too short reads (len < 250) : 0 Number of all reads : 100000 Average Processing Time Read Classification : 0.000399753

Real time : 41.5481 sec CPU time : 42.1981 sec Peak RSS : 68 MByte

So there is something I don't get here ... is there any Saccharomyces read in the test file?

lutfia95 commented 2 years ago

Hi @gchevignon,

The deplete_files contain the reference sequences for reads you would like to reject. Providing only deplete_files will lead ReadBouncer to classify given reads as rejected if their prefixes match against at least one of the given files. Providing target_files : the reads will be assigned to the best matching file.

In your first results, where you provided both deplete_files and target_files, the results are from only deplete_files due to a bug in the classification method if...else...else if statement. After fixing:

Number of classified reads : 47874
Number of too short reads (len < 250) : 0
Number of all reads : 100000
Bacillus_subtilis_complete_genome : 14409 0.14409
Enterococcus_faecalis_complete_genome : 13553 0.13553
Escherichia_coli_complete_genome : 19912 0.19912
Average Processing Time Read Classification : 0.00134035

Providing the deplete_files during adaptive sampling: every read that matches with at least one of the given file is rejected from the corresponding pore by sending an unblock message to MinKNOW for that read.

I hope that I could help, and thanks for testing that, I will fix it now.

Best, Ahmad

JensUweUlrich commented 2 years ago

Resolved in version 1.2.0