epi2me-labs / wf-16s

Other
22 stars 5 forks source link

minimap_pipeline:makeReport (1) error with custom database #26

Closed jadeaver closed 2 months ago

jadeaver commented 3 months ago

Operating System

macOS

Other Linux

No response

Workflow Version

v1.1.2

Workflow Execution

EPI2ME Desktop (Local)

Other workflow execution

No response

EPI2ME Version

v5.1.14

CLI command run

No response

Workflow Execution - CLI Execution Profile

None

What happened?

I created a custom database following the documentation provided in this tutorial. I successfully created the taxdump files, minimap2 database, and ref2taxid file. The wf-16s pipeline runs as expected until the makeReport step where I encounter the error "Process minimap_pipeline:makeReport (1) terminated with an error exit status (1)" with "KeyError: "The following 'id_vars' are not present in the DataFrame: ['species']" (please see nextflow log attached). I do get the abundances table output, so perhaps I don't really need the full report. However, my question is what may be causing this error and is there a file I might need to fix in my custom database to be able to get the full report output?

nextflow.log

The first few lines of my output abundance file are below.

tax barcode01   barcode02   barcode03   barcode04   barcode05   barcode06   barcode07   barcode08   total
k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Bacillaceae; g__Bacillus; s__midas_s_59078    2   5   4   2   6   5   106052  205601  311677
k__Bacteria; p__Firmicutes; c__Bacilli; o__Staphylococcales; f__Staphylococcaceae; g__Staphylococcus; s__midas_s_9536   2   2   0   2   6   3   99534   192059  291608
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Salmonella; s__midas_s_9538  23  22  10  9   3   14  98911   160855  259847
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae; g__Limosilactobacillus; s__Lactobacillus_fermentum 8   1   1   5   1   6   92746   138742  231510

Relevant log output

Workflow logs

No logs yet...

Invocation logs

2024-08-07T12:59Z │ Initialising
2024-08-07T12:59Z │ Reading launch data
2024-08-07T12:59Z │ Acquiring database record
2024-08-07T12:59Z │ Starting workflow invocation
2024-08-07T12:59Z │ Connecting to app via IPC
2024-08-07T12:59Z │ Awaiting weblog via HTTP on 52273
2024-08-07T12:59Z │ Launching nextflow subprocess
2024-08-07T12:59Z │ Uplink established to app
2024-08-07T13:16Z │ Subprocess closed
2024-08-07T13:16Z │ Exiting
2024-08-07T13:19Z │ Initialising
2024-08-07T13:19Z │ Reading launch data
2024-08-07T13:19Z │ Acquiring database record
2024-08-07T13:19Z │ Starting workflow invocation
2024-08-07T13:19Z │ Connecting to app via IPC
2024-08-07T13:19Z │ Awaiting weblog via HTTP on 49497
2024-08-07T13:19Z │ Launching nextflow subprocess
2024-08-07T13:19Z │ Uplink established to app
2024-08-08T03:19Z │ Subprocess closed
2024-08-08T03:19Z │ Exiting
2024-08-08T10:55Z │ Initialising
2024-08-08T10:55Z │ Reading launch data
2024-08-08T10:55Z │ Acquiring database record
2024-08-08T10:55Z │ Starting workflow invocation
2024-08-08T10:55Z │ Connecting to app via IPC
2024-08-08T10:55Z │ Awaiting weblog via HTTP on 62339
2024-08-08T10:55Z │ Launching nextflow subprocess
2024-08-08T10:55Z │ Uplink established to app
2024-08-08T10:55Z │ Subprocess closed
2024-08-08T10:55Z │ Exiting
2024-08-08T11:41Z │ Initialising
2024-08-08T11:41Z │ Reading launch data
2024-08-08T11:41Z │ Acquiring database record
2024-08-08T11:41Z │ Starting workflow invocation
2024-08-08T11:41Z │ Connecting to app via IPC
2024-08-08T11:41Z │ Awaiting weblog via HTTP on 62459
2024-08-08T11:41Z │ Launching nextflow subprocess
2024-08-08T11:41Z │ Uplink established to app
2024-08-08T11:42Z │ Subprocess closed
2024-08-08T11:42Z │ Exiting

Application activity log entry

No items listed in red

Were you able to successfully run the latest version of the workflow with the demo data?

other (please describe below)

Other demo data information

I ran my data (same data as used above with my custom database) using the default settings with minimap2 and the ncbi_16s_18s database. The entire pipeline ran with no issues.

I realized in writing this that I was not using the most up-to-date workflow version (1.2.0) and I am currently re-trying this analysis with the updated workflow.
jadeaver commented 3 months ago

This error did resolve by updating to version 1.2.0.

jadeaver commented 3 months ago

I am actually re-opening because the output report is not as expected. The sunburst plot reported expected taxa, however, the taxonomy, abundances and diversity figures are classifying all the sequences as "unknown". Please see both the abundances table and the sunburst plot below.

abundance_table_species.xlsx

Sunburst_plot
nggvs commented 2 months ago

Hi @jadeaver ,

Sorry for the delay, I'll take a look on it, I suppose than the sunburst and the sankey are working and is the rest of the plots. Could you paste some few lines of the two files for the database: the ref2taxid and the fasta?

jadeaver commented 2 months ago

Thanks for taking a look into this. Yes, the sunburst and sankey plots are working. The plots/tables under taxonomy, abundances, and alpha diversity are showing all as "unknown".

The first few lines of the ref2taxid are:

FLASV1.1417 895459642
FLASV2.1445 893084087
FLASV3.1527 60446185

The first few entries of the fasta are:

>FLASV1.1417
GATGAACGCTGGCGGCGTGCTTAACACATGCAAGTTGAACGGTCTGCTTAGGTAGACAGTGGCGCACGGGTGAGTAACGC
GTAGGTGACCTATCCTTTAGTGGGGGATAACTCAGGGAAACTTGAGCTAATACCGCATGAGCTTGTGGTTGTTAGAGGGC
CACAAGGAAAGCAGCAATGCGCTGAGGGAGGGGCCTGCGTCCGATTAGCTAGTTGGCAAGGTAACGGCTTACCAAGGCGA
TGATCGGTAGCTGGTCTGAGAGGACGATCAGCCACATTGGCACTGAGACACGGGCCAAACTCCTACGGGAGGCAGCAGTG
AGGAATATTGGGCAATGGCCGAAAGGCTGACCCAGCAACGCCGCGTGGAGGACGAAGGCTTTCGGGTTGTAAACTCCTTT
TCCGGGGGACGAGGAAGGACGGTACCCTGGGAATAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAAAACGTAGGTG
GCGAGCGTTATCCGGATTTACTGGGCGTAAAGAGCGCGTAGGTGGTTGAGTAAGTTGGATGTAAAATCTCTTGGCTTAAC
TGGGAGGAGACGTTCAAGACTGCTTGGCTTGAGGGCGAGAGAGGGGTGCAGAATTCCCGGTGTAGTGGTGGAATGCGTAG
ATATCGGGAGGAATACCAGTGGCGAAAGCGGCGCCCTGGCTCGCAACTGACACTGAGGCGCGAAAGCGTGGGTAGCGAAC
GGGATTAGATACCCCGGTAGTCCACGCTGTAAACGATGTGAACTGGGTGTTGGCGGTATGAATTCCGTCGGTGCCGTAGC
AAACGCGATAAGTTCACCGCCTGGGGAGTACGGTCGCAAGGCTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCAG
CGGAGCGTGTGGTTTAATTCGATGCAACGCGAAAAACCTTACCTGGGTTTGACATGGGCGTAGTAGTGAACCGAAAGGGG
AACGAGCCTTCGGGCAGCGTCCACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCA
ACGAGCGCAACCCCTGTTGCCAGTTATAAGTGTCTGGCGAGACTGCCGGTATCAAGCCGGAGGAAGGTGGGGATGACGTC
AAGTCAGCATGGCCTTTATATCCAGGGCTACACACACGCTACAATGGTCGGTACAGAGGGTTGCAAAGCCGCGAGGTAGA
GCTAATCTCACAAAGCCGGCCTCAGTTCAGATTGGAGGCTGCAACTCGCCTCCATGAAGTCGGAGTTGCTAGTAATCGCC
GGTCAGCAATACGGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACGTCATGGGAGCTGGTAACACCTGAA
GTCGGTGAGCTAACCGCGAGGAGGCAGCCGCCGAGGGTGGGACTAGTGACTGGGACG
>FLASV2.1445
GACGAACGCTGGCGGCATGCCTAATACATGCAAGTCGAACGCGACCAGCCGGTGCTTGCACTGGCGAAGTCGAGTGGCGA
ACGGGTGAGTAACACGTGAGAAACCTACCCTGGAGTGGGGAATAACTCGAAGAAATTCGAGCTAATACCGCATACCTTCT
TACCGTCGAATGGTGGTTTGAAGAAAGATTTATCGCTCTGGGAGGGTCTCGCGGCCTATCAGCTAGTTGGTGAGGTAACG
GCTCACCAAGGCATCGACGGGTAGCTGGTCTGAGAGGACGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTAC
GGGAGGCAGCAGTAGGGAATCTTGCGCAATGGGCGAAAGCCTGACGCAGCAATGCCGCGTGCGGGACGAAGGCCCTAGGG
TCGTAAACCGCTTTCAGTAGGGACGAAAATGACGGTACCTGCAGAAGAAGCTCCGGCCAACTACGTGCCAGCAGCCGCGG
TGATACGTAGGGAGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGGGCTCGTAGGTGGTTGAGTAAGTCAGATGTGAAAT
CTCAGGGCCCAACCCTGAGCGTGCATTTGATACTGCTCTGACTAGAGTCCGGTAGGGGAGTGCGGAATTCCTGGTGTAGC
GGTGAAATGCGCAGATATCAGGAGGAACACCGACAGCGAAGGCAGCACTCTGGGCCGGTACTGACACTGAGGAGCGAAAG
CATGGGTAGCAAACAGGATTAGATACCCTGGTAGTCCATGCCGTAAACGTTGGGCACTAGGTGTGGGGAGAACTCAACTC
TCTCCGCGCCGTAGCTAACGCATTAAGTGCCCCGCCTGGGGAGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGG
GGCCCGCACAAGCGGCGGAGCATGTTGCTTAATTCGAGGCAACGCGAAGAACCTTACCTGGGTTGAACTACGTGGGAAAA
GCCGCAGAGATGCGGTGTCCTTCGGGGTCCACGATAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGT
TAAGTCCCGCAACGAGCGCAACCCTTGTCCTATGTTGCCAGCGGGTAAAGCCGGGGACTCGTAGGAGACTGCCGGGGTCA
ACTCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGCCCCTTATGTCCAGGGCTGCAAACATGCTACAATGGCCGGTAC
AACGGGCAGCTAAACCGCGAGGTCAAGCGAATCCCACAAAGCCGGTCTCAGTTCGGATTGAAGTCTGCAACTCGACTTCA
TGAAGCTGGAGTCGCTAGTAATCCCGGATCAGCAACGCCGGGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCA
CACGCCGAAAGTCGGCAACACCCGAAGTCAGTGGCCCAACCCCTAGGGGAGGGAGCTGCCGAAGGTGGGGCTGGCGATTG
GGGTG
>FLASV3.1527
CTTCGACGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCATGCCTAATACATGCAAGTCGAACGCGGCCATCCG
GTGCTTGCACTGGTGAAGCCGAGTGGCGAACGGGTGAGTAACACGTGAGAAACCTGCCCTGGAGTGGGGAATAACTCGAA
GAAATTCGAGCTAATACCGCATACCTTCTCTTCACCGCATGGTGAGTTGAAGAAAGATTTATCGCTCTAGGAGGGTCTCG
CGGCCTATCAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCATCGACGGGTAGCTGGTCTGAGAGGACGATCAGCCACAC
TGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTGCGCAATGGGCGAAAGCCTGACGCAGCA
ATGCCGCGTGCGGGACGAAGGCCCTAGGGTCGTAAACCGCTTTCAGTAGGGACGAAAATGACGGTACCTGCAGAAGAAGC
TCCGGCCAACTACGTGCCAGCAGCCGCGGTGATACGTAGGGAGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGGGCTCG
TAGGTGGTTGAGTAAGTCAGATGTGAAATCTCAGGGCCCAACCCTGAGCCTGCATTTGATACTGCTCTGACTAGAGTCCG
GTAGGGGAGTGCGGAACTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAAGAACACCGACAGCGAAGGCAGCACTCT
GGGCCGGTACTGACACTGAGGAGCGAAAGCATGGGTAGCAAACAGGATTAGATACCCTGGTAGTCCATGCCGTAAACGTT
GGGCACTAGGTGTGGGGAGAACTCAACTCTCTCCGCGCCGTAGCTAACGCATTAAGTGCCCCGCCTGGGGAGTACGGCCG
CAAGGCTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTTGCTTAATTCGAGGCAACGCGAAGAA
CCTTACCTGGGTTGAACTACGTGGGAAAAGCCGCAGAGATGCGGTGTCCTTCGGGGTCCACGATAGGTGGTGCATGGCTG
TCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCTATGTTGCCAGCGGGTAAAGC
CGGGGACTCGTAGGAGACTGCCGGGGTCAACTCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGCCCCTTATGTCCAG
GGCTGCAAACATGCTACAATGGCCGGTACAAAGGGCAGCTAAACCGCGAGGTCAAGCGAATCCCAAAAAGCCGGTCTCAG
TTCGGATTGAAGTCTGCAACTCGACTTCATGAAGCTGGAGTCGCTAGTAATCCCGGATCAGCAACGCCGGGGTGAATACG
TTCCCGGGCCTTGTACACACCGCCCGTCACACGCCGAAAGTCGATAACACCCGAAGTCAGTGGCCCAACCCTTTAGGGAG
GGAGCTGCCGAAGGTGGGATTGGCGATTGGGGTGAAGTCGTAACAAGGTAGCCGTACCGGAAGGTGCGGCTGGATCACCT
CCTTTCT

I have a detailed document with the steps I took to create the custom database as well if that would be helpful.

nggvs commented 2 months ago

Thank you very much! Where these taxids come from (FLASV1.1417 taxid: 895459642)? Are they from ncbi?

jadeaver commented 2 months ago

No, they are from the MiDAS database (https://www.midasfieldguide.org/guide/downloads), which is a curated 16S reference database for wastewater microbiomes. I downloaded the Qiime.fa and QIIME.txt files. I reformated the Qiime.txt file to include column headers id, kingdom, phylum, etc. and used taxonkit to create the taxdump files. I used the Qiime.fa file to make the minimap database. I used the taxid.map that was created with the taxdump files as the ref2taxid file.

nggvs commented 2 months ago

Hi! I've been investigating this issue, for me it looks normal (although I'm using a different dataset). When I observe the same problem than you (all the sequences unknown in the table but the sunburst with data) is when I don't remove the ';' from the taxonomy names. Please, could you try a small check to know if something else is happening?

csvtk space2tab QIIME.txt_MiDAS_5.3.txt > QIIME.txt_MiDAS_5.3.tsv # change spaces files to tabs
sed -i -r 's/[;]+//g' QIIME.txt_MiDAS_5.3.tsv # remove ';' from the end of names

sed -i '1i id\tsuperkingdom\tphylum\tclass\torder\tfamily\tgenus\tspecies' QIIME.txt_MiDAS_5.3.tsv # adding the headers

taxonkit create-taxdump -A 1 QIIME.txt_MiDAS_5.3.tsv -O MiDAS_5.3.taxdump

And then running the wf with:

--reference ~/databases/MIDAS/QIIME.fa_MiDAS_5.3.fa --ref2taxid ~/databases/MIDAS/MiDAS_5.3.taxdump/taxid.map --taxonomy ~/databases/MIDAS/MiDAS_5.3.taxdump/

And check if the results make more sense to you?

Thank you very much in advance

jadeaver commented 2 months ago

Thank you for looking into this and for these suggestions! I remade the custom database using your suggested commands (modified slightly because I am using a Mac OS not Linux). The workflow ran successfully and output the abundance tables in addition to the other figures. Whoo!

A note in case anyone else runs into this issue - I don't think it was the ; in my case. I had removed them during my initial attempt at creating my custom database. My best guess is that something went wrong when I converted the txt file to a tsv file and the formatting was off.

Thank you again for your time @nggvs!