Closed lskatz closed 5 months ago
Hi,
The data generated by the unit tests are mock data meant only for testing individual functions, and can't be used to assess the tool's overall functionality.
To test the full process of Binette, I've set up a test data repository: https://github.com/genotoul-bioinfo/Binette_TestData. This repository contains minimal dataset and expected outcomes which are used in the CI workflow to ensure everything works as it should.
You can find instructions on how to run this test locally in the Test section of the documentation: https://binette.readthedocs.io/en/latest/tests.html#functional-tests.
Keep in mind that these are very basic datasets, designed just to test the tool, and don't reflect a real binning analysis. I created this data by splitting some genomes to mimic metagenomic data.
Best,
Hi, I tried to run Binette in full but I don't really know what's wrong. Can you look over this?
Steps:
I downloaded the metagenome Kickstart
from the above dataset (SAMN05024035).
I ran megahit to get an assembly
megahit -1 Kickstart_1.fastq -2 Kickstart_2.fastq --out-dir Kickstart.megahit --out-prefix R1 --num-cpu-threads 12
I mapped the reads for input for SemiBin2 to create Kickstart.megahit.sorted.bam
.
ref=Kickstart.megahit/R1.contigs.fa
R1=Kickstart_1.fastq
R2=${R1/_1/_2}
bowtie2-build -f $ref $ref
bowtie2 -q --fr -x $ref -1 $R1 -2 $R2 -S Kickstart.sam -p 12
samtools view -h -F 4 -b -S -o Kickstart.unsorted.bam -@ 12 Kickstart.sam
samtools sort -m 1000000000 Kickstart.unsorted.bam -o Kickstart.megahit.sorted.bam -@ 12
samtools flagstats Kickstart.megahit.sorted.bam
6275427 + 0 in total (QC-passed reads + QC-failed reads)
6275427 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
6275427 + 0 mapped (100.00% : N/A)
6275427 + 0 primary mapped (100.00% : N/A)
6275427 + 0 paired in sequencing
3160305 + 0 read1
3115122 + 0 read2
5859266 + 0 properly paired (93.37% : N/A)
6146726 + 0 with itself and mate mapped
128701 + 0 singletons (2.05% : N/A)
49282 + 0 with mate mapped to a different chr
34450 + 0 with mate mapped to a different chr (mapQ>=5)
I ran SemiBin2 three times to give me different results/inputs for Binette. Each iteration uses a random seed and gives a different output.
for rep in 1 2 3; do
SemiBin2 single_easy_bin -i $ref -b Kickstart.megahit.sorted.bam -o Kickstart.megahit.semibin$rep -p 12 > Kickstart.megahit.semibin.log 2>&1;
done
For Binette, I prepared an input folder and then ran it
for i in *megahit.semibin*; do
# Remove header
tail -n +2 $i/contig_bins.tsv > binette.tmp/$i.tsv;
done;
# oops it tried to read from the log file
rm binette.tmp/Kickstart.megahit.semibin.log.tsv
ls binette.tmp/ -lh
total 242K
-rw-------. 1 user users 81K Jun 24 10:50 Kickstart.megahit.semibin2.tsv
-rw-------. 1 user users 81K Jun 24 10:50 Kickstart.megahit.semibin3.tsv
-rw-------. 1 user users 81K Jun 24 10:50 Kickstart.megahit.semibin.tsv
$ head binette.tmp/*
==> binette.tmp/Kickstart.megahit.semibin2.tsv <==
k141_85796 0
k141_53630 1
k141_21463 2
k141_85804 3
k141_32191 4
k141_42940 5
k141_75081 6
k141_10736 7
k141_42942 3
k141_32194 1
==> binette.tmp/Kickstart.megahit.semibin3.tsv <==
k141_85796 0
k141_53630 1
k141_21463 2
k141_85804 1
k141_32191 3
k141_42940 4
k141_75081 5
k141_10736 6
k141_42942 1
k141_32194 3
==> binette.tmp/Kickstart.megahit.semibin.tsv <==
k141_85796 0
k141_53630 1
k141_21463 2
k141_85804 1
k141_32191 3
k141_42940 4
k141_75081 5
k141_10736 6
k141_42942 7
k141_32194 1
So at this point it appears I have different results of MAGs for input for Binette, and I'm ready to run it.
binette --contig2bin_tables binette.tmp/Kickstart.megahit.semibin.tsv --contig2bin_tables binette.tmp/Kickstart.megahit.semibin2.tsv --contig2bin_tables binette.tmp/Kickstart.megahit.semibin3.tsv --threads 12 --outdir Kickstart.megahit.semibin.1.2.3.binette --contigs $ref --checkm2_db $HOME/db/checkm2/CheckM2_database/uniref100.KO.1.dmnd >& binette.log &
tail binette.log
100%|██████████| 5945/5945 [00:00<00:00, 13598.82it/s]
100%|██████████| 5945/5945 [00:00<00:00, 244014.77contig/s]
100%|██████████| 5945/5945 [00:00<00:00, 574979.53contig/s]
head Kickstart.megahit.semibin.1.2.3.binette/final_bins_quality_reports.tsv
bin_id origin name completeness contamination score size N50 contig_count
wc -l Kickstart.megahit.semibin.1.2.3.binette/final_bins_quality_reports.tsv
1 Kickstart.megahit.semibin.1.2.3.binette/final_bins_quality_reports.tsv
I think that the results are empty and so what should I do next?
Hi,
Thanks for testing the tool with real data!
It looks like the issue is with the --contig2bin_tables
argument. You should specify it once and list all the files after it.
The correct command should be:
binette --contig2bin_tables binette.tmp/Kickstart.megahit.semibin.tsv binette.tmp/Kickstart.megahit.semibin2.tsv binette.tmp/Kickstart.megahit.semibin3.tsv \
--threads 12 \
--outdir Kickstart.megahit.semibin.1.2.3.binette \
--contigs $ref \
--checkm2_db $HOME/db/checkm2/CheckM2_database/uniref100.KO.1.dmnd >& binette.log &
If you check the beginning of the log file, it details the different files it processes:
So for example with the test data set:
[..]
[2024-06-24 19:30:40] INFO - Parsing bin2contig files.
[2024-06-24 19:30:40] INFO - 3 bin sets processed:
[2024-06-24 19:30:40] INFO - A - 6 bins
[2024-06-24 19:30:40] INFO - B - 3 bins
[2024-06-24 19:30:40] INFO - C - 4 bins
In your case, only the last table is being processed.
This is something I surely should improve; the tool would ideally throw an error when the same argument is given multiple times.
Thank you! This worked! Here is the result
bin_id origin name completeness contamination score size N50 contig_count
23 22 100.0 0.09 99.82 4698156 82084 97
15256 union 18 | 17 99.93 0.28 99.37 2940499 37523 114
16043 union 25 | 27 | 144 96.2 0.47 95.26 4284705 39217 144
15592 union 21 | 25 93.93 0.19 93.55000000000001 2190965 12519 222
15577 union 34 | 104 | 12 85.47 1.49 82.49 2126765 10106 283
4076 diff 9 - 27 84.57 1.99 80.58999999999999 2971038 6934 492
1253 intersec 5 & 5 82.26 1.62 79.02000000000001 1643026 8518 241
16531 union 56 | 17 75.11 2.73 69.65 1434469 5017 294
14118 union 4 | 3 | 1 63.6 3.84 55.92 3743529 4716 805
15866 union 33 | 10 57.64 2.4 52.84 1892146 4848 410
4173 diff 34 - 14 - 3 48.26 0.49 47.28 1936962 4159 470
7531 diff 18 - 16 44.89 0.14 44.61 965464 4748 215
Why does bin 23
transform into bin 22
in the first line?
Is there a suggested score threshold?
The "bin id" column represents a unique identifier assigned by Binette, while the "name" column shows the original names of the bins that contributed to the final bin.
Here, the original name is "22" and the Binette assigned name is "23". This bin has not been combined with any other bins so the name is simply the original name.
The "name" column may not be very important here; it simply provides information on the initial bins that contributed to its creation and the way they contributed.
For instance 18 | 17
means the final bin has been made from the union of bin 18 and 17.
I understand that this might seem confusing and the name actually give you no clue from which bin set they are from especially if all bins are name the same way in all bin sets.
That's a great explanation. Thank you!
Hi, this is part of the review for https://github.com/openjournals/joss-reviews/issues/6782
I was trying to follow the examples manually but I do not see any example data. I got confused by trying to use your examples and then getting errors. But I think those errors have to do with a mismatch between bins and what is actually in the fasta file. Could you add example data to the repo with examples on how to use it? And then what we'd expect as output for each specific example.
In more detail, I believe that we have a mismatch while following the current example and so I think you should put the proper example data into the repo with instructions.