Closed SilasK closed 3 years ago
Looking up in the docs: https://semibin.readthedocs.io/en/latest/usage/?#multi-sample-binning
Can't I simply name all my contigs
S1.fasta:
>S1_contig1
>S1_sontig2
combined.fasta
>S1_contig1
>S1_contig2
...
>S2_contig1
>S2_contig2
...
Do I really need to name them:
combined.fasta
>S1:S1_contig1
>S1:S1_contig2
...
>S2:S2_contig1
>S2:S2_contig2
...
I think the error is because I named also the
S1.fasta
>S1:S1_contig1
>S1:S1_contig2
which might be the reason for the error.
By the way, what happens if I have all contigs in S1.fasta for the taxonomy and then filter before combining and alignment? I'd like to have the mmseq tax also for the shorter contigs. and I think its a bit redundant to do it twice.
Thanks!
The pipeline of multi-sample binning in SemiBin is (1) Generate data.csv and data_split.csv from combined.fasta for every sample, (2) Train and bin the contigs of every sample.
So in the part (1), we need to know the contig from which sample. So here we use a parameter --separator
to specify this. In our document, we used :
, so you need a combined contig file like >S1:contig1
. But if you have a combined file like >S1_contig1
, you can also use --separator _
to specify the sample.
For the part (2), we just train and bin for every sample. So you just make sure that the contig name of S1.fasta is same to the name used in cannot.txt. I think the error is caused by the inconsistent names of the input contigs and the cannot.txt file.
By the way, what happens if I have all contigs in S1.fasta for the taxonomy and then filter before combining and alignment? I'd like to have the mmseq tax also for the shorter contigs. and I think its a bit redundant to do it twice.
You just need the make sure that the name is consistent when training and binning, Others are file. Gererating the cannot.txt will remove the short contigs.
Sincerely Shaojun
@SilasK : Often the samples all have similarly named contigs (if they are the output of an assembler, it'll be something like k141_123
or something) and we cannot assume that they will be unique. This is why the interface is how it is. I do confess that when @psj1997 first implemented this, I also thought it was confusing, but I frankly have no better idea on how to do it in a general way.
I think that introducing a subcommand to generate the combined FASTA file in the right way is a good idea and could avoid some UX issues. The command can use :
by default, but also check that there are no naming conflicts, &c.
Additionally, it could create an NGLess
script for further processing (while leaving the option of using other tools open for more advanced users).
I see the necessity of ':'
I think that introducing a subcommand to generate the combined FASTA file in the right way is a good idea and could avoid some UX issues. The command can use
:
by default, but also check that there are no naming conflicts, &c. Yes, I think this is a good idea. I saw also you have already themulti_easy_bin
I'm trying simply to make a snakemake pipeline.
@ psj1997 I get the error when I try to use my contigs.
combined.fasta
>shRTF8_0
AATATTTTTTCAAAATGCCCTTTATTCCTTGAAATCCCTCAAAAAAAACTTCTCGAAAAAAGCTCCTAGA
Error:
Traceback (most recent call last):
File "/home/kiesers/scratch/Atlas/databases/conda_envs/1cec50d1b28e8ec0fae5f1275adeb023/bin/SemiBin", line 10, in <module>
sys.exit(main())
File "/home/kiesers/scratch/Atlas/databases/conda_envs/1cec50d1b28e8ec0fae5f1275adeb023/lib/python3.8/site-packages/SemiBin/main.py", line 933, in main
generate_data_multi(
File "/home/kiesers/scratch/Atlas/databases/conda_envs/1cec50d1b28e8ec0fae5f1275adeb023/lib/python3.8/site-packages/SemiBin/main.py", line 534, in generate_data_multi
sample_name, contig_name = seq_record.id.split(separator)
ValueError: not enough values to unpack (expected 2, got 1)
command:
SemiBin generate_data_multi --input-fasta Cobinning/combined_contigs.fasta.gz --input-bam Cobinning/mapping/shHF3.sorted.bam Cobinning/mapping/shHC7.sorted.bam Cobinning/mapping/shHF1.sorted.bam --output Cobinning/SemiBin --threads 12 --separator _ 2> log/semibin/generate_data_multi.log
Here we just run a 'shRTF80'.split('') to split the contig into sample name and contig id. Can you make sure that if all contigs in the combined.fasta in this pattern?
Ok it was my fault
The line
sample_name, contig_name = seq_record.id.split(separator)
should probably be
sample_name, contig_name = seq_record.id.split(separator, 1)
because, otherwise, it can lead to issues if the contig name also contains the separator (not likely with :
, but not unlikely with _
).
For UX reasons, I might even add:
if separator not in seq_record.id:
raise ValueError(f"Expected contigs to contain separator character ({separator}), found {seq_record.id}")
sample_name, contig_name = seq_record.id.split(separator, 1)
Thanks! I will update this.
Sincerely Shaojun
After #42 I tried with unziped fasta. I run into this error in Semibin Train.