Open chris-krohn opened 3 months ago
I have also gotten this same error message multiple times and am not sure what's causing it. I have analysed a similar dataset successfully before so not sure why this error is being thrown up with the current one. Any advice would be greatly appreciated!
Cheers, Lucas
Hi
I worked out a solution with the aid of this post:
https://github.com/genomicsITER/NanoCLUST/issues/84
The error may be occurring due to how the freq.txts file is built. In my case due to the names of my reads the formation of this freq.txts file was being thrown out, see attached
When I changed the names of my reads to be less long and stop after the first space/tab then the freq.txts file was constructed properly and the pipeline proceeded
Cheers
Thank you! I will try that and update on results once done.
Hi!
I doubt it is the read names in my case, I will keep trying but looking at the reads, they come up fine in the [freq.txt]() file.
However, I did notice that filtering was not done as expected.
I ran
nextflow run main.nf -profile docker --reads 'pathto/barcode01/S561_AD1r2.fastq' --db "db/16S_ribosomal_RNA" --tax "db/taxdb/" --min_cluster_size 50 --umap_set_size 100000 --min_read_length 1000 --max_read_length 5000
however instead of min 1000, max 5000 fastp
filtered the default (min 1400, max 1700 bp). It somehow executed this command
`#!/bin/bash -euo pipefail
barcode=S561_AD1r2
fastp -i S561_AD1r2.fastq -q 8 -l 1400 --length_limit 1700 -o $barcode\_qced_reads.fastq
#perl prinseq-lite.pl -fastq S561_AD1r2.fastq -out_good qced_reads -min_len 1400 -max_len 1700 -log qc_log -min_qual_mean 8
head -n$(( 100000*4 )) $barcode\_qced_reads.fastq > $barcode\_qced_reads_set.fastq
capture process environment
set +u
echo barcode=${barcode[@]} > .command.en
...resulting in most if the reads removed (see below). Perhaps that was the reason for umap no proceeding properly?
I dont know if this is a bug, nor how to help in correcting the code if it is a bug?.
If anyone knows - kindly let me know if possible.
Thanks! Chris
Hi all, The read_clustering step failed for whole operons (16S-ITS-23S, ~4.1k mean read length).
The whole nextflow pipeline ran through without errors (on a Mac M1) with test fastq files
mock4_run3bc08_5000.fastq
and with other full-length 16S sequences.However, whole operons (average of 4.1k mean read lenght, 25,000 reads per sample) failed read_clustering - see below. These fastqs are the result of SUP basecalling and demultiplexing with dorado. Library prep with the Native Barcoding Kit 96 V14. Full protocol here. This is a trial for testing utility of whole operon microbial profiling. One of the reads here: S561_AD1r2.txt.
I seems that it has something to do with umap not producing a suitable data.frame for function
pd.concat
to do its thing correctly in the lineumap_out = pd.concat([df["read"], df["length"], df_umap], axis=1)
of theumap_hdbscan.py
script.It would be awesome if someone has ideas for trouble shooting. Cheers, Chris
Workflow execution completed unsuccessfully!
The exit status of the task that caused the workflow execution to fail was: 1.
The full error message was: Error executing process > 'read_clustering (1)' Caused by: Process
read_clustering (1)
terminated with an error exit status (1) Command executed [/Users/christiankrohn/Applications/NanoCLUST-master/templates/umap_hdbscan.py]:!/usr/bin/env python
import numpy as np import umap import matplotlib.pyplot as plt from sklearn import decomposition import random import pandas as pd import hdbscan
df = pd.read_csv("freqs.txt", delimiter=" ")
UMAP
motifs = [x for x in df.columns.values if x not in ["read", "length"]] X = df.loc[:,motifs] X_embedded = umap.UMAP(n_neighbors=15, min_dist=0.1, verbose=2).fit_transform(X)
df_umap = pd.DataFrame(X_embedded, columns=["D1", "D2"]) umap_out = pd.concat([df["read"], df["length"], df_umap], axis=1)
HDBSCAN
X = umap_out.loc[:,["D1", "D2"]] umap_out["bin_id"] = hdbscan.HDBSCAN(min_cluster_size=int(50), cluster_selection_epsilon=int(0.5)).fit_predict(X)
PLOT
plt.figure(figsize=(20,20)) plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=umap_out["bin_id"], cmap='Spectral', s=1) plt.xlabel("UMAP1", fontsize=18) plt.ylabel("UMAP2", fontsize=18) plt.gca().set_aspect('equal', 'datalim') plt.title("Projecting " + str(len(umap_out['bin_id'])) + " reads. " + str(len(umap_out['bin_id'].unique())) + " clusters generated by HDBSCAN", fontsize=18)
for cluster in np.sort(umap_out['bin_id'].unique()): read = umap_out.loc[umap_out['bin_id'] == cluster].iloc[0] plt.annotate(str(cluster), (read['D1'], read['D2']), weight='bold', size=14)
plt.savefig('hdbscan.output.png') umap_out.to_csv("hdbscan.output.tsv", sep=" ", index=False)
Command exit status: 1
Command output:
UMAP(verbose=2) Construct fuzzy simplicial set Sun Jul 28 01:14:21 2024 Finding Nearest Neighbors Sun Jul 28 01:14:23 2024 Finished Nearest Neighbor Search Sun Jul 28 01:14:24 2024 Construct embedding completed 0 / 500 epochs completed 50 / 500 epochs completed 100 / 500 epochs completed 150 / 500 epochs completed 200 / 500 epochs completed 250 / 500 epochs completed 300 / 500 epochs completed 350 / 500 epochs completed 400 / 500 epochs completed 450 / 500 epochs Sun Jul 28 01:14:25 2024 Finished embedding
Command error:
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested Matplotlib created a temporary config/cache directory at /tmp/matplotlib-0xgxd5ov because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing. UMAP(verbose=2) Construct fuzzy simplicial set Sun Jul 28 01:14:21 2024 Finding Nearest Neighbors Sun Jul 28 01:14:23 2024 Finished Nearest Neighbor Search Sun Jul 28 01:14:24 2024 Construct embedding completed 0 / 500 epochs completed 50 / 500 epochs completed 100 / 500 epochs completed 150 / 500 epochs completed 200 / 500 epochs completed 250 / 500 epochs completed 300 / 500 epochs completed 350 / 500 epochs completed 400 / 500 epochs completed 450 / 500 epochs Sun Jul 28 01:14:25 2024 Finished embedding Traceback (most recent call last): File ".command.sh", line 19, in umap_out = pd.concat([df["read"], df["length"], df_umap], axis=1) File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 274, in concat op = _Concatenator( File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 454, in init self.new_axes = self._get_new_axes() File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 519, in _get_new_axes return [ File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 520, in self._get_concat_axis() if i == self.bm_axis else self._get_comb_axis(i) File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 526, in _get_comb_axis return get_objs_combined_axis( File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/indexes/api.py", line 92, in get_objs_combined_axis return _get_combined_index(obs_idxes, intersect=intersect, sort=sort, copy=copy) File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/indexes/api.py", line 145, in _get_combined_index index = union_indexes(indexes, sort=sort) File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/indexes/api.py", line 217, in union_indexes result = result.union(other) File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/indexes/multi.py", line 3360, in union raise NotImplementedError( NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
Work dir: /Users/username/Applications/NanoCLUST-master/work/02/e9194c824e17a756ca11d4a63eb26e
Tip: view the complete command output by changing to the process work dir and entering the command
cat .command.out