genomicsITER / NanoCLUST

NanoCLUST is an analysis pipeline for UMAP-based classification of amplicon-based full-length 16S rRNA nanopore reads
MIT License
106 stars 48 forks source link

read clustering error #24

Open ajinkyakhilari opened 3 years ago

ajinkyakhilari commented 3 years ago

executor > local (5) [d8/6fd53d] process > QC (1) [100%] 1 of 1 ✔ [80/44cf34] process > fastqc (1) [100%] 1 of 1 ✔ [0c/7b75e3] process > kmer_freqs (1) [100%] 1 of 1 ✔ [cb/16e8b0] process > read_clustering (1) [100%] 1 of 1, failed: 1 ✘ [- ] process > split_by_cluster - [- ] process > read_correction - [- ] process > draft_selection - [- ] process > racon_pass - [- ] process > medaka_pass - [- ] process > consensus_classification - [- ] process > join_results - [- ] process > get_abundances - [- ] process > plot_abundances - [12/0efcc7] process > output_documentation [100%] 1 of 1 ✔ Error executing process > 'read_clustering (1)'

Caused by: Process read_clustering (1) terminated with an error exit status (1)

Command executed [/NanoporeTools/NanoCLUST/templates/umap_hdbscan.py]:

!/usr/bin/env python

import numpy as np import umap import matplotlib.pyplot as plt from sklearn import decomposition import random import pandas as pd import hdbscan

df = pd.read_csv("freqs.txt", delimiter="is_Pr")

UMAP

motifs = [x for x in df.columns.values if x not in ["read", "length"]] X = df.loc[:,motifs] X_embedded = umap.UMAP(n_neighbors=15, min_dist=0.1, verbose=2).fit_transform(X)

df_umap = pd.DataFrame(X_embedded, columns=["D1", "D2"]) umap_out = pd.concat([df["read"], df["length"], df_umap], axis=1)

HDBSCAN

X = umap_out.loc[:,["D1", "D2"]] umap_out["bin_id"] = hdbscan.HDBSCAN(min_cluster_size=int(200), cluster_selection_epsilon=int(0.5)).fit_predict(X)

PLOT

plt.figure(figsize=(20,20)) plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=umap_out["bin_id"], cmap='Spectral', s=1) plt.xlabel("UMAP1", fontsize=18) plt.ylabel("UMAP2", fontsize=18) plt.gca().set_aspect('equal', 'datalim') plt.title("Projecting " + str(len(umap_out['bin_id'])) + " reads. " + str(len(umap_out['bin_id'].unique())) + " clusters generated by HDBSCAN", fontsize=18)

for cluster in np.sort(umap_out['bin_id'].unique()): read = umap_out.loc[umap_out['bin_id'] == cluster].iloc[0] plt.annotate(str(cluster), (read['D1'], read['D2']), weight='bold', size=14)

plt.savefig('hdbscan.output.png') umap_out.to_csv("hdbscan.output.tsv", sep=" ", index=False)

Command exit status: 1

Command output: UMAP(verbose=2) Construct fuzzy simplicial set Fri Jan 29 12:06:46 2021 Finding Nearest Neighbors Fri Jan 29 12:06:46 2021 Building RP forest with 21 trees Fri Jan 29 12:06:49 2021 NN descent for 17 iterations 1 / 17 2 / 17 3 / 17 4 / 17 5 / 17 6 / 17 7 / 17 8 / 17 Stopping threshold met -- exiting after 8 iterations Fri Jan 29 12:07:08 2021 Finished Nearest Neighbor Search Fri Jan 29 12:07:10 2021 Construct embedding completed 0 / 200 epochs completed 20 / 200 epochs completed 40 / 200 epochs completed 60 / 200 epochs completed 80 / 200 epochs completed 100 / 200 epochs completed 120 / 200 epochs completed 140 / 200 epochs completed 160 / 200 epochs completed 180 / 200 epochs Fri Jan 29 12:08:08 2021 Finished embedding

Command error: Traceback (most recent call last): File "/home/administrator/Desktop/Bovine_Mastitis_Project/Mastitis_nanopore_data/Project_1/Project1/Project1/20190612_1214_MN26935_FAK72557_229de4aa/work/conda/read_clustering-998d6264058a39a660addfff9962d1f9/lib/python3.8/site-packages/joblib/parallel.py", line 820, in dispatch_one_batch tasks = self._ready_batches.get(block=False) File "/home/administrator/Desktop/Bovine_Mastitis_Project/Mastitis_nanopore_data/Project_1/Project1/Project1/20190612_1214_MN26935_FAK72557_229de4aa/work/conda/read_clustering-998d6264058a39a660addfff9962d1f9/lib/python3.8/queue.py", line 167, in get raise Empty _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File ".command.sh", line 23, in umap_out["bin_id"] = hdbscan.HDBSCAN(min_cluster_size=int(200), cluster_selection_epsilon=int(0.5)).fit_predict(X) File "/home/administrator/Desktop/Bovine_Mastitis_Project/Mastitis_nanopore_data/Project_1/Project1/Project1/20190612_1214_MN26935_FAK72557_229de4aa/work/conda/readclustering-998d6264058a39a660addfff9962d1f9/lib/python3.8/site-packages/hdbscan/hdbscan.py", line 941, in fit_predict self.fit(X) File "/home/administrator/Desktop/Bovine_Mastitis_Project/Mastitis_nanopore_data/Project_1/Project1/Project1/20190612_1214_MN26935_FAK72557_229de4aa/work/conda/readclustering-998d6264058a39a660addfff9962d1f9/lib/python3.8/site-packages/hdbscan/hdbscan.py", line 919, in fit self._min_spanning_tree) = hdbscan(X, *kwargs) File "/home/administrator/Desktop/Bovine_Mastitis_Project/Mastitis_nanopore_data/Project_1/Project1/Project1/20190612_1214_MN26935_FAK72557_229de4aa/work/conda/readclustering-998d6264058a39a660addfff9962d1f9/lib/python3.8/site-packages/hdbscan/hdbscan.py", line 610, in hdbscan (single_linkage_tree, result_min_span_tree) = memory.cache( File "/home/administrator/Desktop/Bovine_Mastitis_Project/Mastitis_nanopore_data/Project_1/Project1/Project1/20190612_1214_MN26935_FAK72557_229de4aa/work/conda/read_clustering-998d6264058a39a660addfff9962d1f9/lib/python3.8/site-packages/joblib/memory.py", line 352, in call return self.func(args, **kwargs) File "/home/administrator/Desktop/Bovine_Mastitis_Project/Mastitis_nanopore_data/Project_1/Project1/Project1/20190612_1214_MN26935_FAK72557_229de4aa/work/conda/readclustering-998d6264058a39a660addfff9962d1f9/lib/python3.8/site-packages/hdbscan/hdbscan.py", line 275, in _hdbscan_boruvka_kdtree alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric, File "hdbscan/_hdbscan_boruvka.pyx", line 375, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init File "hdbscan/_hdbscan_boruvka.pyx", line 411, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds File "/home/administrator/Desktop/Bovine_Mastitis_Project/Mastitis_nanopore_data/Project_1/Project1/Project1/20190612_1214_MN26935_FAK72557_229de4aa/work/conda/read_clustering-998d6264058a39a660addfff9962d1f9/lib/python3.8/site-packages/joblib/parallel.py", line 1041, in call if self.dispatch_one_batch(iterator): File "/home/administrator/Desktop/Bovine_Mastitis_Project/Mastitis_nanopore_data/Project_1/Project1/Project1/20190612_1214_MN26935_FAK72557_229de4aa/work/conda/read_clustering-998d6264058a39a660addfff9962d1f9/lib/python3.8/site-packages/joblib/parallel.py", line 831, in dispatch_one_batch islice = list(itertools.islice(iterator, big_batch_size)) File "hdbscan/_hdbscan_boruvka.pyx", line 412, in genexpr TypeError: delayed() got an unexpected keyword argument 'check_pickle'

Work dir: /home/administrator/Desktop/Bovine_Mastitis_Project/Mastitis_nanopore_data/Project_1/Project1/Project1/20190612_1214_MN26935_FAK72557_229de4aa/work/cb/16e8b0a8d65a824ccac0a1378149f9

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run

bpenaud commented 3 years ago

Hello, I have exactly the same error do you resolve it ? Regards, Benjamin Penaud

Thomieh73 commented 3 years ago

Hey, I tried the test run with conda and my run crashed at the same spot.

This is the error from the specific working directory:

UMAP(verbose=2)
Construct fuzzy simplicial set
Tue Feb 16 16:16:30 2021 Finding Nearest Neighbors
Tue Feb 16 16:16:33 2021 Finished Nearest Neighbor Search
Tue Feb 16 16:16:35 2021 Construct embedding
        completed  0  /  500 epochs
        completed  50  /  500 epochs
        completed  100  /  500 epochs
        completed  150  /  500 epochs
        completed  200  /  500 epochs
        completed  250  /  500 epochs
        completed  300  /  500 epochs
        completed  350  /  500 epochs
        completed  400  /  500 epochs
        completed  450  /  500 epochs
Tue Feb 16 16:16:42 2021 Finished embedding
Traceback (most recent call last):
  File "/cluster/work/users/thhaverk/nanoclust_tmp/fe/f4dc7167db2f6187bd0d5bf4ecc692/.command.sh", line 26, in <module>
    plt.figure(figsize=(20,20))
  File "/cluster/work/users/thhaverk/nanoclust_tmp/conda/read_clustering-800e1e27475cbaa0538f834c4aacc420/lib/python3.8/site-packages/matplotlib/pyplot.py", line 671, in figure
    figManager = new_figure_manager(num, figsize=figsize,
  File "/cluster/work/users/thhaverk/nanoclust_tmp/conda/read_clustering-800e1e27475cbaa0538f834c4aacc420/lib/python3.8/site-packages/matplotlib/pyplot.py", line 299, in new_figure_manager
    return _backend_mod.new_figure_manager(*args, **kwargs)
  File "/cluster/work/users/thhaverk/nanoclust_tmp/conda/read_clustering-800e1e27475cbaa0538f834c4aacc420/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 3494, in new_figure_manager
    return cls.new_figure_manager_given_figure(num, fig)
  File "/cluster/work/users/thhaverk/nanoclust_tmp/conda/read_clustering-800e1e27475cbaa0538f834c4aacc420/lib/python3.8/site-packages/matplotlib/backends/_backend_tk.py", line 868, in new_figure_manager_given_figure
    window = tk.Tk(className="matplotlib")
  File "/cluster/work/users/thhaverk/nanoclust_tmp/conda/read_clustering-800e1e27475cbaa0538f834c4aacc420/lib/python3.8/tkinter/__init__.py", line 2261, in __init__
    self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
_tkinter.TclError: couldn't connect to display "158.36.42.36:25.0"

Any idea how to solve it?

Thomieh73 commented 3 years ago

Okay, that did not work for me. Can you explain why you found that that package was needed?

When I check the error, I see this:

Command error:
  Traceback (most recent call last):
    File ".command.sh", line 26, in <module>
      plt.figure(figsize=(20,20))
    File "/cluster/work/users/thhaverk/nanoclust_tmp/conda/read_clustering-165c04fe82a861f4b9dc6382a66f5ed7/lib/python3.8/site-packages/matplotlib/pyplot.py", line 671, in figure
      figManager = new_figure_manager(num, figsize=figsize,
    File "/cluster/work/users/thhaverk/nanoclust_tmp/conda/read_clustering-165c04fe82a861f4b9dc6382a66f5ed7/lib/python3.8/site-packages/matplotlib/pyplot.py", line 299, in new_figure_manager
      return _backend_mod.new_figure_manager(*args, **kwargs)
    File "/cluster/work/users/thhaverk/nanoclust_tmp/conda/read_clustering-165c04fe82a861f4b9dc6382a66f5ed7/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 3494, in new_figure_manager
      return cls.new_figure_manager_given_figure(num, fig)
    File "/cluster/work/users/thhaverk/nanoclust_tmp/conda/read_clustering-165c04fe82a861f4b9dc6382a66f5ed7/lib/python3.8/site-packages/matplotlib/backends/_backend_tk.py", line 868, in new_figure_manager_given_figure
      window = tk.Tk(className="matplotlib")
    File "/cluster/work/users/thhaverk/nanoclust_tmp/conda/read_clustering-165c04fe82a861f4b9dc6382a66f5ed7/lib/python3.8/tkinter/__init__.py", line 2261, in __init__
      self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
  _tkinter.TclError: couldn't connect to display "158.36.42.36:25.0"

especially the last line, which is an IP address of a display, Why is that needed? I am working on a HPC cluster, so no other display arround then my terminal.

I will check the docker option

Thomieh73 commented 3 years ago

Okay, I solved my issues by modifying the nextflow.config file to use singularity instead of docker. I added a singularity process to the processes part (see below). I work on a HPC cluster where we are not allowed to use docker. But I can use singularity with docker images.

This is the modified nextflow.config file for me:

profiles {
  test { includeConfig 'conf/test.config' }
  conda {
    process {
      withName: demultiplex { conda = "$baseDir/conda_envs/demultiplex/environment.yml" }
      withName: demultiplex_porechop { conda = "$baseDir/conda_envs/demultiplex_porechop/environment.yml" }
      withName: QC { conda = "$baseDir/conda_envs/qc_fastp/environment.yml" }
      withName: fastqc { conda = "$baseDir/conda_envs/fastqc/environment.yml" }
      withName: multiqc { conda = "$baseDir/conda_envs/fastqc/environment.yml" }
      withName: kmer_freqs { conda = "$baseDir/conda_envs/kmer_freqs/environment.yml" }
      withName: read_clustering { conda = "$baseDir/conda_envs/read_clustering/environment.yml" }
      withName: split_by_cluster { conda = "$baseDir/conda_envs/split_by_cluster/environment.yml" }
      withName: read_correction { conda = "$baseDir/conda_envs/read_correction/environment.yml" }
      withName: draft_selection { conda = "$baseDir/conda_envs/draft_selection/environment.yml" }
      withName: racon_pass { conda = "$baseDir/conda_envs/racon_pass/environment.yml" }
      withName: medaka_pass { conda = "$baseDir/conda_envs/medaka_pass/environment.yml" }
      withName: consensus_classification { conda = "$baseDir/conda_envs/consensus_classification/environment.yml" }
      withName: get_abundances { conda = "$baseDir/conda_envs/cluster_plot_pool/environment.yml" }
      withName: plot_abundances { conda = "$baseDir/conda_envs/cluster_plot_pool/environment.yml" }
      withName: output_documentation { conda = "$baseDir/conda_envs/output_documentation/environment.yml" }
    }
  }
  docker {
    docker.enabled = true
    //process.container = 'nf-core/nanoclust:latest'
    process {
      withName: demultiplex { container = 'hecrp/nanoclust-demultiplex' }
      withName: demultiplex_porechop { container = 'hecrp/nanoclust-demultiplex_porechop' }
      withName: QC { container = 'hecrp/nanoclust-qc' }
      withName: fastqc { container = 'hecrp/nanoclust-fastqc' }
      withName: multiqc { container = 'hecrp/nanoclust-fastqc' }
      withName: kmer_freqs { container = 'hecrp/nanoclust-kmer_freqs' }
      withName: read_clustering { container = 'hecrp/nanoclust-read_clustering' }
      withName: split_by_cluster { container = 'hecrp/nanoclust-split_by_cluster' }
      withName: read_correction { container = 'hecrp/nanoclust-read_correction' }
      withName: draft_selection { container = 'hecrp/nanoclust-draft_selection' }
      withName: racon_pass { container = 'hecrp/nanoclust-racon_pass' }
      withName: medaka_pass { container = 'hecrp/nanoclust-medaka_pass' }
      withName: consensus_classification { container = 'hecrp/nanoclust-consensus_classification'
                                           docker.temp = "$baseDir/" }
      withName: get_abundances { container = 'hecrp/nanoclust-plot_abundances' }
      withName: plot_abundances { container = 'hecrp/nanoclust-plot_abundances' }
      withName: output_documentation { container = 'hecrp/nanoclust-output_documentation' }
    }
    }
    singularity {
      singularity.enabled = true
      singularity.autoMounts = true
      //process.container = 'nf-core/nanoclust:latest'
      process {
        withName: demultiplex { container = 'docker://hecrp/nanoclust-demultiplex' }
        withName: demultiplex_porechop { container = 'docker://hecrp/nanoclust-demultiplex_porechop' }
        withName: QC { container = 'docker://hecrp/nanoclust-qc' }
        withName: fastqc { container = 'docker://hecrp/nanoclust-fastqc' }
        withName: multiqc { container = 'docker://hecrp/nanoclust-fastqc' }
        withName: kmer_freqs { container = 'docker://hecrp/nanoclust-kmer_freqs' }
        withName: read_clustering { container = 'docker://hecrp/nanoclust-read_clustering' }
        withName: split_by_cluster { container = 'docker://hecrp/nanoclust-split_by_cluster' }
        withName: read_correction { container = 'docker://hecrp/nanoclust-read_correction' }
        withName: draft_selection { container = 'docker://hecrp/nanoclust-draft_selection' }
        withName: racon_pass { container = 'docker://hecrp/nanoclust-racon_pass' }
        withName: medaka_pass { container = 'docker://hecrp/nanoclust-medaka_pass' }
        withName: consensus_classification { container = 'docker://hecrp/nanoclust-consensus_classification'
                                             singularity.temp = "$baseDir/" }
        withName: get_abundances { container = 'docker://hecrp/nanoclust-plot_abundances' }
        withName: plot_abundances { container = 'docker://hecrp/nanoclust-plot_abundances' }
        withName: output_documentation { container = 'docker://hecrp/nanoclust-output_documentation' }
      }
      }
}
hoohugokim commented 2 years ago

I had the same issue under conda environment, in my case it seems to have stemmed from the version discrepancy between the ./conda_env/read_clustering/environment.yml file and the repository. I modified the version of hdbscan and umap-learn to the newest version found with conda search <package> and it is working fine now.

timyerg commented 7 months ago

@hoohugokim Thank you for your comment. In my case worked removing all package versions specified. Finally got through that step.