Closed gchevignon closed 2 years ago
Hi @gchevignon,
Thanks for your interest in ReadBouncer. In the config.toml
file and the code examples is correct written as deplete_files
. In the README.md
there are some depletion_files
insetad of deplete_files
. Thanks for pointing that out, I will change it now.
Hi thanks for your reply ! Yes I understood there was no error in the code 1min before your reply ... But there is something weird...
If I use this toml:
usage = "classify" output_directory = '/home/userlocal/work/test_readbouncer.dir/' log_directory = '/home/userlocal/work/test_readbouncer.dir/'
[IBF]
kmer_size = 13
fragment_size = 100000
threads = 12
target_files = ['/home/userlocal/work/test_readbouncer.dir/Bacillus_subtilis_complete_genome.ibf', '/home/userlocal/work/test_readbouncer.dir/Enterococcus_faecalis_complete_genome.ibf', '/home/userlocal/work/test_readbouncer.dir/Escherichia_coli_complete_genome.ibf']
deplete_files = ['/home/userlocal/work/test_readbouncer.dir/Saccharomyces_cerevisiae_draft_genome.ibf']
read_files = ['/home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta']
exp_seq_error_rate = 0.1 #(unsigned float between 0 and 1; default: 0.1)
chunk_length = 250 #(unsigned integer; default: 250)
max_chunks = 5 #(unsigned integer; default: 5)
I got this:
/usr/local/ReadBouncer/bin/ReadBouncer /home/userlocal/Bureau/Scripts.dir/ReadBouncer.dir/config_classify_test.toml 41 bins were loaded in 0.00248442 seconds from the IBF 29 bins were loaded in 0.000980902 seconds from the IBF 50 bins were loaded in 0.000881843 seconds from the IBF 159 bins were loaded in 0.0077419 seconds from the IBF
Classification results of: /home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta
Real time : 40.4076 sec CPU time : 40.9985 sec Peak RSS : 106 MByte
then if I use this toml: (remove deplete line)
usage = "classify" output_directory = '/home/userlocal/work/test_readbouncer.dir/' log_directory = '/home/userlocal/work/test_readbouncer.dir/'
[IBF]
kmer_size = 13
fragment_size = 100000
threads = 12
target_files = ['/home/userlocal/work/test_readbouncer.dir/Bacillus_subtilis_complete_genome.ibf', '/home/userlocal/work/test_readbouncer.dir/Enterococcus_faecalis_complete_genome.ibf', '/home/userlocal/work/test_readbouncer.dir/Escherichia_coli_complete_genome.ibf']
read_files = ['/home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta'] exp_seq_error_rate = 0.1 #(unsigned float between 0 and 1; default: 0.1) chunk_length = 250 #(unsigned integer; default: 250) max_chunks = 5 #(unsigned integer; default: 5)
I got this :
/usr/local/ReadBouncer/bin/ReadBouncer /home/userlocal/Bureau/Scripts.dir/ReadBouncer.dir/config_classify_test.toml 41 bins were loaded in 0.00262547 seconds from the IBF 29 bins were loaded in 0.000993481 seconds from the IBF 50 bins were loaded in 0.000893316 seconds from the IBF
Classification results of: /home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta
Real time : 7.96554 sec CPU time : 16.4309 sec Peak RSS : 59 MByte
then if I use this toml: remove target line
usage = "classify" output_directory = '/home/userlocal/work/test_readbouncer.dir/' log_directory = '/home/userlocal/work/test_readbouncer.dir/'
[IBF]
kmer_size = 13
fragment_size = 100000
threads = 12
deplete_files = ['/home/userlocal/work/test_readbouncer.dir/Saccharomyces_cerevisiae_draft_genome.ibf'] read_files = ['/home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta'] exp_seq_error_rate = 0.1 #(unsigned float between 0 and 1; default: 0.1) chunk_length = 250 #(unsigned integer; default: 250) max_chunks = 5 #(unsigned integer; default: 5)
I got this:
/usr/local/ReadBouncer/bin/ReadBouncer /home/userlocal/Bureau/Scripts.dir/ReadBouncer.dir/config_classify_test.toml 159 bins were loaded in 0.00787578 seconds from the IBF
Classification results of: /home/userlocal/work/test_readbouncer.dir/SampleZMCDataSet.fasta
Real time : 41.5481 sec CPU time : 42.1981 sec Peak RSS : 68 MByte
So there is something I don't get here ... is there any Saccharomyces read in the test file?
Hi @gchevignon,
The deplete_files
contain the reference sequences for reads you would like to reject. Providing only deplete_files
will lead ReadBouncer to classify given reads as rejected if their prefixes match against at least one of the given files.
Providing target_files
: the reads will be assigned to the best matching file.
In your first results, where you provided both deplete_files
and target_files
, the results are from only deplete_files
due to a bug in the classification method if...else...else if statement
. After fixing:
Number of classified reads : 47874
Number of too short reads (len < 250) : 0
Number of all reads : 100000
Bacillus_subtilis_complete_genome : 14409 0.14409
Enterococcus_faecalis_complete_genome : 13553 0.13553
Escherichia_coli_complete_genome : 19912 0.19912
Average Processing Time Read Classification : 0.00134035
Providing the deplete_files
during adaptive sampling: every read that matches with at least one of the given file is rejected from the corresponding pore by sending an unblock message to MinKNOW for that read.
I hope that I could help, and thanks for testing that, I will fix it now.
Best, Ahmad
Resolved in version 1.2.0
Hello, Thanks a lot for your tool ! I am only at the classify step but this tool seems amazing. I have struggled a lot making the first test working because there is an error in your readme: In all your exemple of code for the toml file you use : deplete_files = ['xxx.fasta', 'xxx.fasta'] instead of depletion_files = ['xxx.fasta', 'xxx.fasta']. Best Germain