genomicsITER / NanoCLUST

NanoCLUST is an analysis pipeline for UMAP-based classification of amplicon-based full-length 16S rRNA nanopore reads
MIT License
106 stars 49 forks source link

Error read_clustering (ValueError: could not convert string to float: 'TTTTG') #84

Open SieglindeCoppens opened 1 year ago

SieglindeCoppens commented 1 year ago

Hi!

I was getting the following error for both the test data and my own data:

executor >  local (5)
[98/2d4484] process > QC (1)                   [100%] 1 of 1 ✔
[87/209bff] process > fastqc (1)               [100%] 1 of 1 ✔
[9b/223d33] process > kmer_freqs (1)           [100%] 1 of 1 ✔
[e6/82bfaa] process > read_clustering (1)      [100%] 1 of 1, failed: 1 ✘
[-        ] process > split_by_cluster         -
[-        ] process > read_correction          -
[-        ] process > draft_selection          -
[-        ] process > racon_pass               -
[-        ] process > medaka_pass              -
[-        ] process > consensus_classification -
[-        ] process > join_results             -
[-        ] process > get_abundances           -
[-        ] process > plot_abundances          -
[80/872ff6] process > output_documentation     [100%] 1 of 1 ✔
Error executing process > 'read_clustering (1)'

Caused by:
  Process `read_clustering (1)` terminated with an error exit status (1)

Command executed [/home/idun/1_Software/NanoCLUST/templates/umap_hdbscan.py]:

  #!/usr/bin/env python

  import numpy as np
  import umap
  import matplotlib.pyplot as plt
  from sklearn import decomposition
  import random
  import pandas as pd
  import hdbscan

  df = pd.read_csv("freqs.txt", delimiter=" ")

  #UMAP
  motifs = [x for x in df.columns.values if x not in ["read", "length"]]
  X = df.loc[:,motifs]
  X_embedded = umap.UMAP(n_neighbors=15, min_dist=0.1, verbose=2).fit_transform(X)

  df_umap = pd.DataFrame(X_embedded, columns=["D1", "D2"])
  umap_out = pd.concat([df["read"], df["length"], df_umap], axis=1)

  #HDBSCAN
  X = umap_out.loc[:,["D1", "D2"]]
  umap_out["bin_id"] = hdbscan.HDBSCAN(min_cluster_size=int(50), cluster_selection_epsilon=int(0.5)).fit_predict(X)

  #PLOT
  plt.figure(figsize=(20,20))
  plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=umap_out["bin_id"], cmap='Spectral', s=1)
  plt.xlabel("UMAP1", fontsize=18)
  plt.ylabel("UMAP2", fontsize=18)
  plt.gca().set_aspect('equal', 'datalim')
  plt.title("Projecting " + str(len(umap_out['bin_id'])) + " reads. " + str(len(umap_out['bin_id'].unique())) + " clusters generated by HDBSCAN", fontsize=18)

  for cluster in np.sort(umap_out['bin_id'].unique()):
      read = umap_out.loc[umap_out['bin_id'] == cluster].iloc[0]
      plt.annotate(str(cluster), (read['D1'], read['D2']), weight='bold', size=14)

  plt.savefig('hdbscan.output.png')
  umap_out.to_csv("hdbscan.output.tsv", sep="   ", index=False)

Command exit status:
  1

Command output:
  (empty)

Command error:
  Matplotlib created a temporary config/cache directory at /tmp/matplotlib-dyrbsl_v because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
  sys:1: DtypeWarning: Columns (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513) have mixed types.Specify dtype option on import or set low_memory=False.
  Traceback (most recent call last):
    File ".command.sh", line 16, in <module>
      X_embedded = umap.UMAP(n_neighbors=15, min_dist=0.1, verbose=2).fit_transform(X)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/umap/umap_.py", line 2014, in fit_transform
      self.fit(X, y)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/umap/umap_.py", line 1613, in fit
      X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C")
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/sklearn/utils/validation.py", line 72, in inner_f
      return f(**kwargs)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/sklearn/utils/validation.py", line 598, in check_array
      array = np.asarray(array, order=order, dtype=dtype)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
      return array(a, dtype, copy=False, order=order)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/generic.py", line 1778, in __array__
      return np.asarray(self._values, dtype=dtype)
    File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
      return array(a, dtype, copy=False, order=order)
  ValueError: could not convert string to float: 'TTTTG'

Work dir:
  /home/idun/1_Software/NanoCLUST/work/e6/82bfaa94d00dc318b1037dc0f4851f

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

It seemed to be caused by a first line in the freqs.txt that was not being skipped (see below), so the dataframe in the umap_hdbscan.py script did not get loaded in correctly. image I changed line 11 of umap_hdbscan.py to skip the first line. From: df = pd.read_csv("$kmer_freqs", delimiter="\t") To: df = pd.read_csv("$kmer_freqs", delimiter="\t", skiprows=[0])

And now it works fine for me.

I just wanted to note this issue if anyone else encountered it!