a-h-b / binny

GNU General Public License v3.0
28 stars 6 forks source link

Workflow running error #23

Closed JaphethLin closed 2 years ago

JaphethLin commented 2 years ago

Hi, I had some problems running the last step, Binny. Is it because the data set is too big? I have 30 million sequences.

An error message is displayed below :

[Tue Feb 22 09:24:56 2022]
Job 0: binny: Running Python Binny.

Activating conda environment: /public4/home/sc56656/lim_workplace/binny-2.0.3/conda/e863c8786ccf5d967e99b330ab85ab84
Activating conda environment: /public4/home/sc56656/lim_workplace/binny-2.0.3/conda/e863c8786ccf5d967e99b330ab85ab84
Traceback (most recent call last):
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/binny_out/.snakemake/scripts/tmp87uohsem.binny_main.py", line 116, in <module>
    all_good_bins, contig_data_df_org = iterative_embedding(x_contigs, depth_dict, all_good_bins, starting_completeness,
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/workflow/scripts/binny_functions.py", line 1200, in iterative_embedding
    embedding1 = embedding.optimize(n_iter=tsne_early_exag_iterations, exaggeration=early_exag,
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/e863c8786ccf5d967e99b330ab85ab84/lib/python3.8/site-packages/openTSNE/tsne.py", line 670, in optimize
    error, embedding = embedding.optimizer(
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/e863c8786ccf5d967e99b330ab85ab84/lib/python3.8/site-packages/openTSNE/tsne.py", line 1666, in __call__
    error, gradient = objective_function(
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/e863c8786ccf5d967e99b330ab85ab84/lib/python3.8/site-packages/openTSNE/tsne.py", line 1441, in kl_divergence_fft
    sum_P, kl_divergence_ = _tsne.estimate_positive_gradient_nn(
  File "openTSNE/_tsne.pyx", line 104, in openTSNE._tsne.estimate_positive_gradient_nn
ValueError: Buffer dtype mismatch, expected 'int' but got 'long'
[Tue Feb 22 15:05:09 2022]
Error in rule binny:
    jobid: 0
    output: bins
    log: logs/binning_binny.log (check log file(s) for error message)
    conda-env: /public4/home/sc56656/lim_workplace/binny-2.0.3/conda/e863c8786ccf5d967e99b330ab85ab84

Traceback (most recent call last):
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 593, in _callback
    raise ex
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 579, in cached_or_run
    run_func(*args)
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2460, in run_wrapper
    raise ex
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2357, in run_wrapper
    run(
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/Snakefile", line 471, in __rule_binny
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/lib/python3.10/site-packages/snakemake/script.py", line 1369, in script
    executor.evaluate()
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/lib/python3.10/site-packages/snakemake/script.py", line 381, in evaluate
    self.execute_script(fd.name, edit=edit)
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/lib/python3.10/site-packages/snakemake/script.py", line 582, in execute_script
    self._execute_cmd("{py_exec} {fname:q}", py_exec=py_exec, fname=fname)
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/lib/python3.10/site-packages/snakemake/script.py", line 414, in _execute_cmd
    return shell(
  File "/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/lib/python3.10/site-packages/snakemake/shell.py", line 265, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'source /public4/home/sc56656/lim_workplace/binny-2.0.3/conda/snakemake_env/bin/activate '/public4/home/sc56656/lim_workplace/binny-2.0.3/conda/e863c8786ccf5d967e99b330ab85ab84'; set -euo pipefail;  python /public4/home/sc56656/lim_workplace/binny-2.0.3/binny_out/.snakemake/scripts/tmp87uohsem.binny_main.py' returned non-zero exit status 1.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Thanks in advance!

ohickl commented 2 years ago

Hi, is that 30 million contigs actually used by binny, so the max_n_contigs in the config is e.g. 30e6? I dont remember having that error before, its in openTSNE. Could you also post the binny log in path/to/your/output/logs/binning_binny.log so we can check what went on before the crash?

JaphethLin commented 2 years ago

This is my binny log :

(base) [sc56656@ln25%bscc-a binny_out]$ cat binny.log 
22/02/2022 09:27:29 AM - Starting Binny run for sample binny.
22/02/2022 09:43:37 AM - Looking for single contig bins.
22/02/2022 09:47:53 AM - Finished searching for single contig bins in 210s.
22/02/2022 09:47:56 AM - Found 0 single contig bins.
22/02/2022 09:48:56 AM - 26755842 contigs match length threshold of 1500bp or contain marker genes and have a size of at least 0bp
22/02/2022 09:54:00 AM - Created load balanced list in 3s.
22/02/2022 09:54:00 AM - k-mer sizes to count: 2, 3, 4.
22/02/2022 10:00:20 AM - Finished counting k-mer frequencies for size 2.
22/02/2022 10:09:49 AM - Finished counting k-mer frequencies for size 3.
22/02/2022 10:34:47 AM - Finished counting k-mer frequencies for size 4.
22/02/2022 10:35:00 AM - K-mer frequency matrix created in 2512s.
22/02/2022 11:05:41 AM - Running with 26755438 contigs. Filtered 0 contigs using a min contig size of 0 to stay below 30000000.0 contigs
22/02/2022 11:11:13 AM - Running manifold learning and dimension reduction.
22/02/2022 11:38:24 AM - PCA stats: Dimensions: 40; Amount of variation explained: 75%.
22/02/2022 11:38:44 AM - Running t-SNE dimensionality-reduction with perplexities 10 and 100.
(base) [sc56656@ln25%bscc-a binny_out]$ 

And, this is my configuration :

mem:
  # If your HPC resource has a high memory capacity node you can set this to
  # TRUE and specify the amount of memory per core (e.g. if a node has 260 gb of
  # RAM and 10 cores it would be 26).
  big_mem_avail: TRUE
  big_mem_per_core_gb: 31
  # Memory per core of your computing resource.
  normal_mem_per_core_gb: 8
# Path to a temporary directory to write to.
tmp_dir: tmp
raws:
  # Path to a assembly fasta.
  assembly: "data/red.contigs.sorted.rnm.t1000.fasta"
  # Path to a bam file to calculate depth from (single sample mode only atm).
  # Leave empty if you have an average depth per contig file to supply to binny.
  metagenomics_alignment: ""
  # Path to an average depth per contig tsv file. Leave empty if you supply a
  # bam file for binny to calculate average contig depth from.
  contig_depth: "data/binny_depth.txt"
# Sample name
sample: "binny"
# Path to desired output dir binny should create and store results in.
outputdir: "binny_out"
# Path to binny dbs, leave at default if you performed the defaulty install.
db_path: "database"
binning:
  binny:
    # Input a list, e.g. '2,3,4'.
    kmers: '2,3,4'
    # Minimum contig length.
    cutoff: 1000
    # Minimum length of contigs containing CheckM markers.
    cutoff_marker: 0
    # Maximum number of contigs binny uses. If the number of available
    # contigs after minimum size filtering exceeds this, binny will
    # increase the minimum size threshold until the maximum is reached.
    # Prevents use of excessive amounts of memory on large assemblies.
    # Default should ensure adequate performance, adjust e.g. according
    # to available memory.
    max_n_contigs: 3e7
# Distance metric for opentSNE and HDBSCAN.
    distance_metric: 'manhattan'
    embedding:
      # Maximum number of binny iterations.
      max_iterations: 50
      # Number of iterations with perplexity: (n contigs) / 100.
      tsne_early_exag_iterations: 250
      # Number of iterations with perplexity 1.
      tsne_main_iterations: 750
    clustering:
      # Increasing the HDBSCAN cluster selection epsilon beyond 0.5
      # is not advised as it might massively increase run time, but it might
      # help recover fragmented genomes that would be missed with lower settings.
      hdbscan_epsilon: 0.25
      # Adapted from the HDBSCAN manual: 'Measure of how conservative the
      # clustering should be. With larger values, more points will be declared
      # as noise, and clusters will be restricted to progressively more dense areas.'.
      hdbscan_min_samples: 2
      # Use depth as additional dimension during the initial clustering.
      include_depth_initial: 'False'
      # Use depth as additional dimension during the main clusterings.
      include_depth_main: 'True'
    bin_quality:
      # Minimum value binny will lower completeness to while running. It will
      # start at 90.
      completeness: 80
      # Minimum purity for bins to be selected.
      purity: 85
ohickl commented 2 years ago

It might be that that kind of input is just too large for openTSNE, ill look into it. In the meantime, unless you need all contigs to be included, i would recommend to set cutoff_marker to e.g. 300-500 to hopefully eliminate a large amount of small contigs. Or is the assembly already prefiltered to a certain contig length? Btw, your config says cutoff: 1000 but the log reports ...contigs match length threshold of 1500bp.... Are you certain you are running with this config file (although the rest seems to match)?

JaphethLin commented 2 years ago

Oh, I ran it twice in the case of only changing the cutoff (1000 and 1500bp), but both got this error. Could the tsne_early_exag_iterations parameter be the cause of this error ? Then I will try again according to your suggestion.

ohickl commented 2 years ago

Ok. tsne_early_exag_iterations just controls the number of iterations to run the first part of the embedding, but it seems to fail to run at all. Let us know, if it worked!