Sample dataset - Githubissues

lskatz commented 5 months ago

Hi, this is part of the review for https://github.com/openjournals/joss-reviews/issues/6782

I was trying to follow the examples manually but I do not see any example data. I got confused by trying to use your examples and then getting errors. But I think those errors have to do with a mismatch between bins and what is actually in the fasta file. Could you add example data to the repo with examples on how to use it? And then what we'd expect as output for each specific example.

In more detail, I believe that we have a mismatch while following the current example and so I think you should put the proper example data into the repo with instructions.

(binette) $ mkdir LKTEST
(binette) $ cd LKTEST/
(binette) $ cat > bin_set1.tsv
contig_1   binA
contig_8   binA
contig_15  binB
contig_9   binC
(binette) $ cat > bin_set2.tsv
contig_1   bin.0
contig_8   bin.0
contig_15  bin.1
contig_9   bin.2
contig_10  bin.0
(binette) $ find .. -name '*.fasta'
../tests/contigs.fasta
../output.fasta
(binette) $ binette --contig2bin_tables bin_set1.tsv bin_set2.tsv --contigs ../tests/contigs.fasta
Traceback (most recent call last):
  File "/bin/miniconda3/envs/binette/bin/binette", line 8, in <module>
    sys.exit(main())
  File "/bin/miniconda3/envs/binette/lib/python3.8/site-packages/binette/main.py", line 334, in main
    bin_set_name_to_bins, original_bins, contigs_in_bins, contig_to_length = parse_input_files(args.bin_dirs, args.contig2bin_tables, args.contigs, fasta_extensions=set(args.fasta_extensions))
  File "/bin/miniconda3/envs/binette/lib/python3.8/site-packages/binette/main.py", line 153, in parse_input_files
    bin_set_name_to_bins = bin_manager.parse_contig2bin_tables(bin_name_to_bin_table)
  File "/bin/miniconda3/envs/binette/lib/python3.8/site-packages/binette/bin_manager.py", line 222, in parse_contig2bin_tables
    bin_name_to_bins[name] = get_bins_from_contig2bin_table(contig2bin_table, name)
  File "/bin/miniconda3/envs/binette/lib/python3.8/site-packages/binette/bin_manager.py", line 243, in get_bins_from_contig2bin_table
    bin_name = line.strip().split("\t")[1]
IndexError: list index out of range
(binette) $ seqtk comp ../tests/contigs.fasta
contig1 1020    294     190     286     250     0       0       0       110     0       0       0
contig2 1680    598     248     292     542     0       0       0       26      0       0       0

JeanMainguy commented 5 months ago

Hi,

The data generated by the unit tests are mock data meant only for testing individual functions, and can't be used to assess the tool's overall functionality.

To test the full process of Binette, I've set up a test data repository: https://github.com/genotoul-bioinfo/Binette_TestData. This repository contains minimal dataset and expected outcomes which are used in the CI workflow to ensure everything works as it should.

You can find instructions on how to run this test locally in the Test section of the documentation: https://binette.readthedocs.io/en/latest/tests.html#functional-tests.

Keep in mind that these are very basic datasets, designed just to test the tool, and don't reflect a real binning analysis. I created this data by splitting some genomes to mimic metagenomic data.

Best,

lskatz commented 5 months ago

Ok I am going to try some real data from our metagenomics benchmark. Do you have suggestions on how to make good input data MAGs for Binette? I was thinking of trying out SemiBin2.

lskatz commented 5 months ago

Hi, I tried to run Binette in full but I don't really know what's wrong. Can you look over this?

Steps:

I downloaded the metagenome Kickstart from the above dataset (SAMN05024035).

I ran megahit to get an assembly

megahit -1 Kickstart_1.fastq -2 Kickstart_2.fastq --out-dir Kickstart.megahit --out-prefix R1 --num-cpu-threads 12

I mapped the reads for input for SemiBin2 to create Kickstart.megahit.sorted.bam.

ref=Kickstart.megahit/R1.contigs.fa
R1=Kickstart_1.fastq
R2=${R1/_1/_2}
bowtie2-build -f $ref $ref
bowtie2 -q --fr -x $ref -1 $R1 -2 $R2 -S Kickstart.sam -p 12
samtools view -h -F 4 -b -S -o Kickstart.unsorted.bam -@ 12 Kickstart.sam
samtools sort -m 1000000000 Kickstart.unsorted.bam -o Kickstart.megahit.sorted.bam -@ 12

samtools flagstats Kickstart.megahit.sorted.bam
6275427 + 0 in total (QC-passed reads + QC-failed reads)
6275427 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
6275427 + 0 mapped (100.00% : N/A)
6275427 + 0 primary mapped (100.00% : N/A)
6275427 + 0 paired in sequencing
3160305 + 0 read1
3115122 + 0 read2
5859266 + 0 properly paired (93.37% : N/A)
6146726 + 0 with itself and mate mapped
128701 + 0 singletons (2.05% : N/A)
49282 + 0 with mate mapped to a different chr
34450 + 0 with mate mapped to a different chr (mapQ>=5)

I ran SemiBin2 three times to give me different results/inputs for Binette. Each iteration uses a random seed and gives a different output.

for rep in 1 2 3; do
  SemiBin2 single_easy_bin -i $ref -b Kickstart.megahit.sorted.bam -o Kickstart.megahit.semibin$rep -p 12 > Kickstart.megahit.semibin.log 2>&1; 
done

For Binette, I prepared an input folder and then ran it

for i in *megahit.semibin*; do 
  # Remove header
  tail -n +2 $i/contig_bins.tsv > binette.tmp/$i.tsv; 
done;
# oops it tried to read from the log file
rm binette.tmp/Kickstart.megahit.semibin.log.tsv
ls binette.tmp/ -lh
total 242K
-rw-------. 1 user users 81K Jun 24 10:50 Kickstart.megahit.semibin2.tsv
-rw-------. 1 user users 81K Jun 24 10:50 Kickstart.megahit.semibin3.tsv
-rw-------. 1 user users 81K Jun 24 10:50 Kickstart.megahit.semibin.tsv
$ head binette.tmp/*
==> binette.tmp/Kickstart.megahit.semibin2.tsv <==
k141_85796      0
k141_53630      1
k141_21463      2
k141_85804      3
k141_32191      4
k141_42940      5
k141_75081      6
k141_10736      7
k141_42942      3
k141_32194      1

==> binette.tmp/Kickstart.megahit.semibin3.tsv <==
k141_85796      0
k141_53630      1
k141_21463      2
k141_85804      1
k141_32191      3
k141_42940      4
k141_75081      5
k141_10736      6
k141_42942      1
k141_32194      3

==> binette.tmp/Kickstart.megahit.semibin.tsv <==
k141_85796      0
k141_53630      1
k141_21463      2
k141_85804      1
k141_32191      3
k141_42940      4
k141_75081      5
k141_10736      6
k141_42942      7
k141_32194      1

So at this point it appears I have different results of MAGs for input for Binette, and I'm ready to run it.

binette --contig2bin_tables binette.tmp/Kickstart.megahit.semibin.tsv --contig2bin_tables binette.tmp/Kickstart.megahit.semibin2.tsv --contig2bin_tables binette.tmp/Kickstart.megahit.semibin3.tsv  --threads 12 --outdir Kickstart.megahit.semibin.1.2.3.binette --contigs $ref --checkm2_db $HOME/db/checkm2/CheckM2_database/uniref100.KO.1.dmnd >& binette.log &

tail binette.log
100%|██████████| 5945/5945 [00:00<00:00, 13598.82it/s]
100%|██████████| 5945/5945 [00:00<00:00, 244014.77contig/s]
100%|██████████| 5945/5945 [00:00<00:00, 574979.53contig/s]

head Kickstart.megahit.semibin.1.2.3.binette/final_bins_quality_reports.tsv
bin_id  origin  name    completeness    contamination   score   size    N50     contig_count
wc -l Kickstart.megahit.semibin.1.2.3.binette/final_bins_quality_reports.tsv
1 Kickstart.megahit.semibin.1.2.3.binette/final_bins_quality_reports.tsv

I think that the results are empty and so what should I do next?

JeanMainguy commented 5 months ago

Hi,

Thanks for testing the tool with real data!

It looks like the issue is with the --contig2bin_tables argument. You should specify it once and list all the files after it.

The correct command should be:

binette --contig2bin_tables binette.tmp/Kickstart.megahit.semibin.tsv binette.tmp/Kickstart.megahit.semibin2.tsv binette.tmp/Kickstart.megahit.semibin3.tsv \
        --threads 12 \
        --outdir Kickstart.megahit.semibin.1.2.3.binette \
        --contigs $ref \
        --checkm2_db $HOME/db/checkm2/CheckM2_database/uniref100.KO.1.dmnd >& binette.log &

If you check the beginning of the log file, it details the different files it processes:

So for example with the test data set:

[..]
[2024-06-24 19:30:40] INFO - Parsing bin2contig files.
[2024-06-24 19:30:40] INFO - 3 bin sets processed:
[2024-06-24 19:30:40] INFO -  A - 6 bins
[2024-06-24 19:30:40] INFO -  B - 3 bins
[2024-06-24 19:30:40] INFO -  C - 4 bins

In your case, only the last table is being processed.

This is something I surely should improve; the tool would ideally throw an error when the same argument is given multiple times.

lskatz commented 5 months ago

Thank you! This worked! Here is the result

bin_id  origin    name           completeness  contamination  score              size     N50    contig_count
23                22             100.0         0.09           99.82              4698156  82084  97
15256   union     18 | 17        99.93         0.28           99.37              2940499  37523  114
16043   union     25 | 27 | 144  96.2          0.47           95.26              4284705  39217  144
15592   union     21 | 25        93.93         0.19           93.55000000000001  2190965  12519  222
15577   union     34 | 104 | 12  85.47         1.49           82.49              2126765  10106  283
4076    diff      9 - 27         84.57         1.99           80.58999999999999  2971038  6934   492
1253    intersec  5 & 5          82.26         1.62           79.02000000000001  1643026  8518   241
16531   union     56 | 17        75.11         2.73           69.65              1434469  5017   294
14118   union     4 | 3 | 1      63.6          3.84           55.92              3743529  4716   805
15866   union     33 | 10        57.64         2.4            52.84              1892146  4848   410
4173    diff      34 - 14 - 3    48.26         0.49           47.28              1936962  4159   470
7531    diff      18 - 16        44.89         0.14           44.61              965464   4748   215

lskatz commented 5 months ago

Why does bin 23 transform into bin 22 in the first line?

Is there a suggested score threshold?

JeanMainguy commented 5 months ago

The "bin id" column represents a unique identifier assigned by Binette, while the "name" column shows the original names of the bins that contributed to the final bin. Here, the original name is "22" and the Binette assigned name is "23". This bin has not been combined with any other bins so the name is simply the original name. The "name" column may not be very important here; it simply provides information on the initial bins that contributed to its creation and the way they contributed. For instance 18 | 17 means the final bin has been made from the union of bin 18 and 17. I understand that this might seem confusing and the name actually give you no clue from which bin set they are from especially if all bins are name the same way in all bin sets.

lskatz commented 5 months ago

That's a great explanation. Thank you!

genotoul-bioinfo / Binette

Sample dataset #18