genomicsITER / NanoCLUST

NanoCLUST is an analysis pipeline for UMAP-based classification of amplicon-based full-length 16S rRNA nanopore reads
MIT License
106 stars 49 forks source link

amplicons of whole operon (16S-ITS-23S, ~4.1k mean read length) failed at read_clustering step #93

Open chris-krohn opened 3 months ago

chris-krohn commented 3 months ago

Hi all, The read_clustering step failed for whole operons (16S-ITS-23S, ~4.1k mean read length).

The whole nextflow pipeline ran through without errors (on a Mac M1) with test fastq files mock4_run3bc08_5000.fastq and with other full-length 16S sequences.

However, whole operons (average of 4.1k mean read lenght, 25,000 reads per sample) failed read_clustering - see below. These fastqs are the result of SUP basecalling and demultiplexing with dorado. Library prep with the Native Barcoding Kit 96 V14. Full protocol here. This is a trial for testing utility of whole operon microbial profiling. One of the reads here: S561_AD1r2.txt.

I seems that it has something to do with umap not producing a suitable data.frame for function pd.concat to do its thing correctly in the line umap_out = pd.concat([df["read"], df["length"], df_umap], axis=1) of the umap_hdbscan.py script.

It would be awesome if someone has ideas for trouble shooting. Cheers, Chris

Workflow execution completed unsuccessfully!

The exit status of the task that caused the workflow execution to fail was: 1.

The full error message was: Error executing process > 'read_clustering (1)' Caused by: Process read_clustering (1) terminated with an error exit status (1) Command executed [/Users/christiankrohn/Applications/NanoCLUST-master/templates/umap_hdbscan.py]:

!/usr/bin/env python

import numpy as np import umap import matplotlib.pyplot as plt from sklearn import decomposition import random import pandas as pd import hdbscan

df = pd.read_csv("freqs.txt", delimiter=" ")

UMAP

motifs = [x for x in df.columns.values if x not in ["read", "length"]] X = df.loc[:,motifs] X_embedded = umap.UMAP(n_neighbors=15, min_dist=0.1, verbose=2).fit_transform(X)

df_umap = pd.DataFrame(X_embedded, columns=["D1", "D2"]) umap_out = pd.concat([df["read"], df["length"], df_umap], axis=1)

HDBSCAN

X = umap_out.loc[:,["D1", "D2"]] umap_out["bin_id"] = hdbscan.HDBSCAN(min_cluster_size=int(50), cluster_selection_epsilon=int(0.5)).fit_predict(X)

PLOT

plt.figure(figsize=(20,20)) plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=umap_out["bin_id"], cmap='Spectral', s=1) plt.xlabel("UMAP1", fontsize=18) plt.ylabel("UMAP2", fontsize=18) plt.gca().set_aspect('equal', 'datalim') plt.title("Projecting " + str(len(umap_out['bin_id'])) + " reads. " + str(len(umap_out['bin_id'].unique())) + " clusters generated by HDBSCAN", fontsize=18)

for cluster in np.sort(umap_out['bin_id'].unique()): read = umap_out.loc[umap_out['bin_id'] == cluster].iloc[0] plt.annotate(str(cluster), (read['D1'], read['D2']), weight='bold', size=14)

plt.savefig('hdbscan.output.png') umap_out.to_csv("hdbscan.output.tsv", sep=" ", index=False)

Command exit status: 1

Command output:

UMAP(verbose=2) Construct fuzzy simplicial set Sun Jul 28 01:14:21 2024 Finding Nearest Neighbors Sun Jul 28 01:14:23 2024 Finished Nearest Neighbor Search Sun Jul 28 01:14:24 2024 Construct embedding completed 0 / 500 epochs completed 50 / 500 epochs completed 100 / 500 epochs completed 150 / 500 epochs completed 200 / 500 epochs completed 250 / 500 epochs completed 300 / 500 epochs completed 350 / 500 epochs completed 400 / 500 epochs completed 450 / 500 epochs Sun Jul 28 01:14:25 2024 Finished embedding

Command error:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested Matplotlib created a temporary config/cache directory at /tmp/matplotlib-0xgxd5ov because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing. UMAP(verbose=2) Construct fuzzy simplicial set Sun Jul 28 01:14:21 2024 Finding Nearest Neighbors Sun Jul 28 01:14:23 2024 Finished Nearest Neighbor Search Sun Jul 28 01:14:24 2024 Construct embedding completed 0 / 500 epochs completed 50 / 500 epochs completed 100 / 500 epochs completed 150 / 500 epochs completed 200 / 500 epochs completed 250 / 500 epochs completed 300 / 500 epochs completed 350 / 500 epochs completed 400 / 500 epochs completed 450 / 500 epochs Sun Jul 28 01:14:25 2024 Finished embedding Traceback (most recent call last): File ".command.sh", line 19, in umap_out = pd.concat([df["read"], df["length"], df_umap], axis=1) File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 274, in concat op = _Concatenator( File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 454, in init self.new_axes = self._get_new_axes() File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 519, in _get_new_axes return [ File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 520, in self._get_concat_axis() if i == self.bm_axis else self._get_comb_axis(i) File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 526, in _get_comb_axis return get_objs_combined_axis( File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/indexes/api.py", line 92, in get_objs_combined_axis return _get_combined_index(obs_idxes, intersect=intersect, sort=sort, copy=copy) File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/indexes/api.py", line 145, in _get_combined_index index = union_indexes(indexes, sort=sort) File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/indexes/api.py", line 217, in union_indexes result = result.union(other) File "/opt/conda/envs/read_clustering/lib/python3.8/site-packages/pandas/core/indexes/multi.py", line 3360, in union raise NotImplementedError( NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.

Work dir: /Users/username/Applications/NanoCLUST-master/work/02/e9194c824e17a756ca11d4a63eb26e

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

lucas-g-huggins commented 3 months ago

I have also gotten this same error message multiple times and am not sure what's causing it. I have analysed a similar dataset successfully before so not sure why this error is being thrown up with the current one. Any advice would be greatly appreciated!

Cheers, Lucas

lucas-g-huggins commented 3 months ago

Hi

I worked out a solution with the aid of this post:

https://github.com/genomicsITER/NanoCLUST/issues/84

The error may be occurring due to how the freq.txts file is built. In my case due to the names of my reads the formation of this freq.txts file was being thrown out, see attached Picture1

When I changed the names of my reads to be less long and stop after the first space/tab then the freq.txts file was constructed properly and the pipeline proceeded

Cheers

chris-krohn commented 3 months ago

Thank you! I will try that and update on results once done.

chris-krohn commented 2 months ago

Hi!

I doubt it is the read names in my case, I will keep trying but looking at the reads, they come up fine in the [freq.txt]() file.

However, I did notice that filtering was not done as expected.

I ran nextflow run main.nf -profile docker --reads 'pathto/barcode01/S561_AD1r2.fastq' --db "db/16S_ribosomal_RNA" --tax "db/taxdb/" --min_cluster_size 50 --umap_set_size 100000 --min_read_length 1000 --max_read_length 5000

however instead of min 1000, max 5000 fastp filtered the default (min 1400, max 1700 bp). It somehow executed this command

`#!/bin/bash -euo pipefail
barcode=S561_AD1r2
fastp -i S561_AD1r2.fastq -q 8 -l 1400 --length_limit 1700 -o $barcode\_qced_reads.fastq
#perl prinseq-lite.pl -fastq S561_AD1r2.fastq -out_good qced_reads -min_len 1400 -max_len 1700 -log qc_log -min_qual_mean 8
head -n$(( 100000*4 )) $barcode\_qced_reads.fastq > $barcode\_qced_reads_set.fastq
 capture process environment
set +u
echo barcode=${barcode[@]} > .command.en

...resulting in most if the reads removed (see below). Perhaps that was the reason for umap no proceeding properly?

I dont know if this is a bug, nor how to help in correcting the code if it is a bug?.

If anyone knows - kindly let me know if possible.

Thanks! Chris

Screenshot 2024-09-16 at 2 10 58 PM