BigDataBiology / SemiBin

SemiBin: metagenomics binning with self-supervised deep learning
https://semibin.rtfd.io/
107 stars 10 forks source link

Error when running SemiBin2 (normalization?) #159

Open eperezv opened 4 months ago

eperezv commented 4 months ago

Hello,

I'm running SemiBin2 to my dataset with the multi_easy_bin option. Everything seemed to work properly until it failed with something related to normalization. Any idea of the issue cause and/or how to address it?

Thank you

(SemiBin) eduardo@eduardo-PC:/data$ SemiBin2 multi_easy_bin -i contigs.flt.fna -b mapped/*.sort.bam -o semibin2_output --separator _ -p 18
[2024-03-08 10:19:33,306] INFO: Binning for short_read
[2024-03-08 10:19:33,306] INFO: SemiBin will run in self supervised mode
[2024-03-08 10:19:34,370] INFO: Running with GPU.
[2024-03-08 10:19:34,370] INFO: Performing multi-sample binning
[2024-03-08 10:19:34,371] INFO: Generating training data...
[2024-03-08 10:20:17,377] INFO: Calculating coverage for every sample.
[2024-03-08 11:31:04,311] INFO: Processed: mapped/C101.sort.bam
[2024-03-08 11:37:05,271] INFO: Processed: mapped/C102.sort.bam
[2024-03-08 11:37:05,272] INFO: Processed: mapped/C103.sort.bam
[2024-03-08 11:37:05,272] INFO: Processed: mapped/C111.sort.bam
[2024-03-08 11:37:05,272] INFO: Processed: mapped/C112.sort.bam
[2024-03-08 11:37:05,272] INFO: Processed: mapped/C113.sort.bam
[2024-03-08 11:37:05,272] INFO: Processed: mapped/C11.sort.bam
[2024-03-08 11:41:28,363] INFO: Processed: mapped/C12.sort.bam
[2024-03-08 11:50:46,311] INFO: Processed: mapped/C13.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C161.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C162.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C163.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C171.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C172.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C173.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C181.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C182.sort.bam
[2024-03-08 11:50:46,312] INFO: Processed: mapped/C183.sort.bam
[2024-03-08 11:53:33,711] INFO: Processed: mapped/C191.sort.bam
[2024-03-08 12:03:55,805] INFO: Processed: mapped/C192.sort.bam
[2024-03-08 12:08:27,854] INFO: Processed: mapped/C193.sort.bam
[2024-03-08 12:33:25,614] INFO: Processed: mapped/C1.sort.bam
[2024-03-08 12:38:03,411] INFO: Processed: mapped/C21.sort.bam
[2024-03-08 12:38:03,411] INFO: Processed: mapped/C22.sort.bam
[2024-03-08 12:38:03,411] INFO: Processed: mapped/C23.sort.bam
[2024-03-08 12:38:03,411] INFO: Processed: mapped/C2.sort.bam
[2024-03-08 12:38:03,412] INFO: Processed: mapped/C31.sort.bam
[2024-03-08 12:38:03,412] INFO: Processed: mapped/C32.sort.bam
[2024-03-08 12:38:03,412] INFO: Processed: mapped/C33.sort.bam
[2024-03-08 12:38:03,412] INFO: Processed: mapped/C3.sort.bam
[2024-03-08 12:42:33,510] INFO: Processed: mapped/C81.sort.bam
[2024-03-08 12:42:33,510] INFO: Processed: mapped/C82.sort.bam
[2024-03-08 12:42:33,510] INFO: Processed: mapped/C83.sort.bam
[2024-03-08 12:42:33,510] INFO: Processed: mapped/C91.sort.bam
[2024-03-08 12:44:14,776] INFO: Processed: mapped/C92.sort.bam
[2024-03-08 12:48:07,180] INFO: Processed: mapped/C93.sort.bam
[2024-03-08 12:48:07,180] INFO: Processed: mapped/CE1.sort.bam
[2024-03-08 12:48:07,180] INFO: Processed: mapped/CE2.sort.bam
[2024-03-08 12:48:07,180] INFO: Processed: mapped/CE3.sort.bam
[2024-03-08 13:12:59,818] INFO: Training model and clustering for S1CNODE.
[2024-03-08 13:12:59,820] INFO: Start training from a single sample.
[2024-03-08 13:13:00,438] INFO: Training model...
  0%|                                                                                                           | 0/15 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/eduardo/miniconda3/envs/SemiBin/bin/SemiBin2", line 33, in <module>
    sys.exit(load_entry_point('SemiBin==2.1.0', 'console_scripts', 'SemiBin2')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin-2.1.0-py3.12.egg/SemiBin/main.py", line 1563, in main2
    multi_easy_binning(
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin-2.1.0-py3.12.egg/SemiBin/main.py", line 1326, in multi_easy_binning
    training(logger, None, args.num_process,
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin-2.1.0-py3.12.egg/SemiBin/main.py", line 1103, in training
    model = train_self(logger,
            ^^^^^^^^^^^^^^^^^^
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin-2.1.0-py3.12.egg/SemiBin/self_supervised_model.py", line 77, in train_self
    train_data_depth = normalize(train_data_depth, axis=1, norm='l1')
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/scikit_learn-1.4.1.post1-py3.12-linux-x86_64.egg/sklearn/utils/_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/scikit_learn-1.4.1.post1-py3.12-linux-x86_64.egg/sklearn/preprocessing/_data.py", line 1925, in normalize
    X = check_array(
        ^^^^^^^^^^^^
  File "/home/eduardo/miniconda3/envs/SemiBin/lib/python3.12/site-packages/scikit_learn-1.4.1.post1-py3.12-linux-x86_64.egg/sklearn/utils/validation.py", line 1072, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0, 39)) while a minimum of 1 is required by the normalize function.
psj1997 commented 4 months ago

It seems it still the error when combining the k-mer features and abundance features. Can you have a look for the files generated from SemiBin for every sample? (data.csv/data_split.csv/cov.csv) How many columns in these files?

Thanks!

eperezv commented 4 months ago

I see a folder containing the fasta files and files like C1.sort.bam_21_data.cov.csv and C1.sort.bam_21_data_split_cov.csv. But there are also other folders per each sample that contain maybe what you are asking for. data.csv contains 176 columns (i.e., one with no head, 135 columns named 1, 2, 3... and then another 39 colums with mapped/C1.sort.bam_cov data_split.csv same as before but just the heads. data_cov.csv contains 40 columns (one with numbers + 39 that are my samples, sme as before, mapped/C1...

psj1997 commented 4 months ago

Can you show the five first rows of the data.csv ,data_split.csv,data_csv.csv and cov_split.csv?

eperezv commented 4 months ago

I don't have exactly the files you indicate, but these are the ones I have (per sample)

data.csv image

data_split.csv image

data_cov.csv image

data_split_cov.csv image

psj1997 commented 4 months ago

Can you help to check the first columns of data_split_cov.csv? If they are '1581622_1, 1581622_2'? Thanks!

eperezv commented 4 months ago

There is no _1, _2... Only what's shown.