Problems with contig names

SilasK commented 3 years ago

After #42 I tried with unziped fasta. I run into this error in Semibin Train.


2021-09-12 01:20:53,955 - Start training from one sample.
2021-09-12 01:20:57,764 - Generate training data of 0:
Traceback (most recent call last):
  File "/home/kiesers/scratch/Atlas/databases/conda_envs/1cec50d1b28e8ec0fae5f1275adeb023/bin/SemiBin", line 10, in <module>
    sys.exit(main())
  File "/home/kiesers/scratch/Atlas/databases/conda_envs/1cec50d1b28e8ec0fae5f1275adeb023/lib/python3.8/site-packages/SemiBin/main.py", line 947, in main
    training(logger, args.contig_fasta, args.bams, args.num_process,
  File "/home/kiesers/scratch/Atlas/databases/conda_envs/1cec50d1b28e8ec0fae5f1275adeb023/lib/python3.8/site-packages/SemiBin/main.py", line 676, in training
    model = train(
  File "/home/kiesers/scratch/Atlas/databases/conda_envs/1cec50d1b28e8ec0fae5f1275adeb023/lib/python3.8/site-packages/SemiBin/semi_supervised_model.py", line 204, in train
    train_input_1.append(train_data_input[mapObj[str(link[0])]])
KeyError: 'shHF6:_shHF6_12241'

SilasK commented 3 years ago

Looking up in the docs: https://semibin.readthedocs.io/en/latest/usage/?#multi-sample-binning

Can't I simply name all my contigs

S1.fasta:

>S1_contig1
>S1_sontig2

combined.fasta

>S1_contig1
>S1_contig2
...
>S2_contig1
>S2_contig2
...

Do I really need to name them:

combined.fasta

>S1:S1_contig1
>S1:S1_contig2
...
>S2:S2_contig1
>S2:S2_contig2
...

I think the error is because I named also the

S1.fasta

>S1:S1_contig1
>S1:S1_contig2

which might be the reason for the error.

By the way, what happens if I have all contigs in S1.fasta for the taxonomy and then filter before combining and alignment? I'd like to have the mmseq tax also for the shorter contigs. and I think its a bit redundant to do it twice.

psj1997 commented 3 years ago

Thanks!

The pipeline of multi-sample binning in SemiBin is (1) Generate data.csv and data_split.csv from combined.fasta for every sample, (2) Train and bin the contigs of every sample.

So in the part (1), we need to know the contig from which sample. So here we use a parameter --separator to specify this. In our document, we used : , so you need a combined contig file like >S1:contig1. But if you have a combined file like >S1_contig1, you can also use --separator _ to specify the sample.

For the part (2), we just train and bin for every sample. So you just make sure that the contig name of S1.fasta is same to the name used in cannot.txt. I think the error is caused by the inconsistent names of the input contigs and the cannot.txt file.

By the way, what happens if I have all contigs in S1.fasta for the taxonomy and then filter before combining and alignment? I'd like to have the mmseq tax also for the shorter contigs. and I think its a bit redundant to do it twice.

You just need the make sure that the name is consistent when training and binning, Others are file. Gererating the cannot.txt will remove the short contigs.

Sincerely Shaojun

luispedro commented 3 years ago

@SilasK : Often the samples all have similarly named contigs (if they are the output of an assembler, it'll be something like k141_123 or something) and we cannot assume that they will be unique. This is why the interface is how it is. I do confess that when @psj1997 first implemented this, I also thought it was confusing, but I frankly have no better idea on how to do it in a general way.

I think that introducing a subcommand to generate the combined FASTA file in the right way is a good idea and could avoid some UX issues. The command can use : by default, but also check that there are no naming conflicts, &c.

Additionally, it could create an NGLess script for further processing (while leaving the option of using other tools open for more advanced users).

SilasK commented 3 years ago

I see the necessity of ':'

I think that introducing a subcommand to generate the combined FASTA file in the right way is a good idea and could avoid some UX issues. The command can use : by default, but also check that there are no naming conflicts, &c. Yes, I think this is a good idea. I saw also you have already the multi_easy_bin I'm trying simply to make a snakemake pipeline.

SilasK commented 3 years ago

@ psj1997 I get the error when I try to use my contigs.

combined.fasta

>shRTF8_0
AATATTTTTTCAAAATGCCCTTTATTCCTTGAAATCCCTCAAAAAAAACTTCTCGAAAAAAGCTCCTAGA

Error:

Traceback (most recent call last):
  File "/home/kiesers/scratch/Atlas/databases/conda_envs/1cec50d1b28e8ec0fae5f1275adeb023/bin/SemiBin", line 10, in <module>
    sys.exit(main())
  File "/home/kiesers/scratch/Atlas/databases/conda_envs/1cec50d1b28e8ec0fae5f1275adeb023/lib/python3.8/site-packages/SemiBin/main.py", line 933, in main
    generate_data_multi(
  File "/home/kiesers/scratch/Atlas/databases/conda_envs/1cec50d1b28e8ec0fae5f1275adeb023/lib/python3.8/site-packages/SemiBin/main.py", line 534, in generate_data_multi
    sample_name, contig_name = seq_record.id.split(separator)
ValueError: not enough values to unpack (expected 2, got 1)

command:

 SemiBin generate_data_multi  --input-fasta Cobinning/combined_contigs.fasta.gz  --input-bam Cobinning/mapping/shHF3.sorted.bam Cobinning/mapping/shHC7.sorted.bam Cobinning/mapping/shHF1.sorted.bam  --output Cobinning/SemiBin  --threads 12  --separator _  2> log/semibin/generate_data_multi.log

psj1997 commented 3 years ago

Here we just run a 'shRTF80'.split('') to split the contig into sample name and contig id. Can you make sure that if all contigs in the combined.fasta in this pattern?

SilasK commented 3 years ago

Ok it was my fault

luispedro commented 3 years ago

The line

    sample_name, contig_name = seq_record.id.split(separator)

should probably be

    sample_name, contig_name = seq_record.id.split(separator, 1)

because, otherwise, it can lead to issues if the contig name also contains the separator (not likely with :, but not unlikely with _).

For UX reasons, I might even add:

    if separator not in seq_record.id:
       raise ValueError(f"Expected contigs to contain separator character ({separator}), found {seq_record.id}")
    sample_name, contig_name = seq_record.id.split(separator, 1)

psj1997 commented 3 years ago

Thanks! I will update this.

Sincerely Shaojun

BigDataBiology / SemiBin

Problems with contig names #43