genomicsITER / NanoCLUST

NanoCLUST is an analysis pipeline for UMAP-based classification of amplicon-based full-length 16S rRNA nanopore reads
MIT License
106 stars 49 forks source link

Performing differential OTU using nanoclust data #37

Open kcmtest opened 3 years ago

kcmtest commented 3 years ago
The pipeline ran after removing the samples which was resulting in error 
executor >  local (6046)
[a2/6ffd4c] process > QC (69)                         [100%] 73 of 73 ✔
[c6/b1e07e] process > fastqc (73)                     [100%] 73 of 73 ✔
[0b/655df7] process > kmer_freqs (67)                 [100%] 73 of 73 ✔
[1c/1ca964] process > read_clustering (71)            [100%] 73 of 73 ✔
[79/e62849] process > split_by_cluster (73)           [100%] 73 of 73 ✔
[a9/c29c19] process > read_correction (1046)          [100%] 1048 of 1048 ✔
[27/40b6e0] process > draft_selection (1048)          [100%] 1048 of 1048 ✔
[3a/e5d1e2] process > racon_pass (1048)               [100%] 1048 of 1048 ✔
[8b/936ff4] process > medaka_pass (1048)              [100%] 1048 of 1048 ✔
[a2/d78685] process > consensus_classification (1048) [100%] 1050 of 1050, failed: 2, retries: 2 ✔
[42/532a74] process > join_results (73)               [100%] 73 of 73 ✔
[b3/746fed] process > get_abundances (73)             [100%] 73 of 73 ✔
[db/dd86da] process > plot_abundances (292)           [100%] 292 of 292 ✔
[84/796ce6] process > output_documentation            [100%] 1 of 1 ✔
[nf-core/nanoclust] Pipeline completed successfully
WARN: [nf-core/nanoclust] Could not attach MultiQC report to summary email
Completed at: 20-May-2021 20:14:01
Duration    : 1h 57m 23s
CPU hours   : 97.6 (0% failed)
Succeeded   : 6'044
Failed      : 2

Previously i ran kraken2 where I would generate OTU table from various class and then perform differential OTU using deseq2 as it was raw counts.

How to do the same with the nanoclust output? It gives relative abundances.

Any suggestion how to go about this

mansi-aai commented 1 year ago

@kcmtest Did you find the way to get OTU count table ? I need that table also to get alpha and beta diversity. Thank you !

timyerg commented 8 months ago

If someone still wonders: In the *_nanoclust_out.txt "reads_in_cluster" column is the column used for calculating relative abundances at the species level. It can be used for alpha/beta diversity and differential abundance tests if they require raw/absolute counts (LEfSe is fine with relative). To get it for lower ranks, one can add full taxonomy by columns and collapse counts.

Here is the function that was used by developers to calculate relative abundance:

def get_abundance_values(names,paths):
    dfs = []
    for name,path in zip(names,paths):
        data = pd.read_csv(path, index_col=False, sep=';').iloc[:,1:]

        total = sum(data['reads_in_cluster'])
        rel_abundance=[]

        for index,row in data.iterrows():
            rel_abundance.append(row['reads_in_cluster'] / total)

        data['rel_abundance'] = rel_abundance
        dfs.append(pd.DataFrame({'taxid': data['taxid'], 'rel_abundance': rel_abundance}))
        data.to_csv("" + name + "_nanoclust_out.txt")

    return dfs