dbeisser / Natrix2

Open-source bioinformatics pipeline for the preprocessing of raw amplicon sequencing / metabarcoding data.
MIT License
11 stars 2 forks source link

Read correction error. Nanopore #4

Closed timyerg closed 9 months ago

timyerg commented 9 months ago

That is me again, unfortunately, with a new issue.

I am running Nanopore 16S data. My reads are already assembled, and primers removed. As advised here, I switched to dev repo and disabled the pychopper.

After running for a while, an error was encountered. I thought that it may be caused by RAM and increased it from 120 to 256, however, after relaunching the same project the same error arised.

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 12
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   all
    167 cd_hit
    171 cdhit
    1   clust_merge_results_vsearch
    171 cluster_sorting
    171 counts_minimap
    167 fastqc
    1   filtering
    1   generate_otu_fasta
    171 medaka_polishing
    1   merge_mumu_mothur_output
    1   merge_output
    170 minimap_align
    170 minimap_align_2
    171 minimap_align_3
    171 minimap_align_4
    171 minimap_medaka
    1   mothur_classify
    1   multiqc
    170 racon_polishing
    171 racon_polishing_2
    171 racon_polishing_3
    171 racon_polishing_4
    1   run_mumu
    1   unfiltered_table
    171 vsearch_chim
    1   vsearch_clust
    1   vsearch_header
    1   vsearch_otu
    1   write_clusters_uc
    1   write_fasta
    2740
Select jobs to execute...

[Mon Feb 26 17:41:15 2024]
rule racon_polishing_2:
    input: Nanopore_results/fasta/C03F3_A_R1.fasta, Nanopore_results/read_correction/minimap/C03F3_A_R1_align_2.sam, Nanopore_results/read_correction/racon/C03F3_A_R1_racon_1.fasta
    output: Nanopore_results/read_correction/racon/C03F3_A_R1_racon_2.tmp, Nanopore_results/read_correction/racon/C03F3_A_R1_racon_2.fasta
    jobid: 385
    wildcards: sample=C03F3, unit=A
    threads: 12

Activating conda environment: /Natrix2/.snakemake/conda/09de290b
[racon::Polisher::initialize] loaded target sequences 0.002148 s
[racon::Polisher::initialize] error: duplicate sequence 50b42c65-75ef-48f3-928f-6e5bb26b4741 with unequal data
[Mon Feb 26 17:41:19 2024]
Error in rule racon_polishing_2:
    jobid: 385
    output: Nanopore_results/read_correction/racon/C03F3_A_R1_racon_2.tmp, Nanopore_results/read_correction/racon/C03F3_A_R1_racon_2.fasta
    conda-env: /Natrix2/.snakemake/conda/09de290b
    shell:

                    racon  -m 8 -x -6 -g -8 -w 500 -t 12 Nanopore_results/fasta/C03F3_A_R1.fasta Nanopore_results/read_correction/minimap/C03F3_A_R1_align_2.sam Nanopore_results/read_correction/racon/C03F3_A_R1_racon_1.fasta > Nanopore_results/read_correction/racon/C03F3_A_R1_racon_2.tmp;
                    sed "s/[>].*[^|]|/>/" Nanopore_results/read_correction/racon/C03F3_A_R1_racon_2.tmp > Nanopore_results/read_correction/racon/C03F3_A_R1_racon_2.fasta

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job racon_polishing_2 since they might be corrupted:
Nanopore_results/read_correction/racon/C03F3_A_R1_racon_2.tmp
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /Natrix2/.snakemake/log/2024-02-26T174111.290794.snakemake.log

As discussed here, the issue is related to duplicate IDs, but I don't know how to fix it here.

I hope that it is OK that I created two issues in one day.

timyerg commented 9 months ago

I think that I fixed the issue by modifying the rules. I will wait and see how far I will get this time. Can I fork the repo and pull changes? Or I can put the modified file here when I will make sure that all fixes are working as intended.

adeep619 commented 9 months ago

Hi there,

We tested the changes on our dataset and they are working well. Could you please provide us with some additional information about your headers and base caller data? This will help us determine if the issue is specific to your dataset. Additionally, could you please share the details of the changes you have made?

Thank you, Aman

timyerg commented 9 months ago

That means I am lucky to have a dataset that can cause some issues! As always =). I added one line to each iteration that renames IDs in the racon output since the error was caused by the same IDs in fasta files and racon outputs in the next minimap run.

            shell:
                """
                    racon  -m 8 -x -6 -g -8 -w 500 -t {threads} {input[0]} {input[1]} {input[2]} > {output.tmp};
                    sed -i -e 's/>/>polished3/g' {output.tmp};
                    sed 's/[>].*[^|]|/>' {output.tmp} > {output.final}
                """

After fixing it like this some samples already progressed to medaka with following IDs:

>polished4polished3polished2polished18fc9f2bb-eb3f-4eef-954b-4b25ba6535bb

From the person whos dataset I am working with I got following info:

primers and adapters are already trimmed and the only quality-based filtering performed was in the minion during base calling in HAC (high-accuracy) mode (kept nanopore reads with Q > 8).

After receiving the files, I run chopper:

gunzip -c $gz | chopper -q 10 -l 1200 --threads {THREADS} | gzip > $CR/$file

Here is an example from original fastq file I got:

@db01deda-dc12-4612-af5e-c2a00ac9193d runid=d16a83d0f3e4f48e46f5ca136bf647f57d0f5927 sampleid=no_sample read=26672 ch=425 start_time=2023-06-07T11:21:32Z model_version_id=2021-05-17_dna_r9.4.1_minion_384_d37a2ab9 barcode=barcode06
GATGAACGCCGGCGGTGTGCTAATACATGCAAGTCGAACGCGTTGGCCCAATTGATTGATGGTGCTTTACCTGATTGATTTTTGGTCGCCAGCGAAGGTGGCGGACGGGTGGTAACACGTAGATGGCCCTGCCCAGAAGCGGGGGACAACATTTGGAAACAGATGCTAATACCGCATAACAGCGTTGTTCGCATGAACAACGCTTCTGCTTCTCGCTATCACTTCTGGATGGACCTGCGGTCATTAGCTTGTTGGTGGGGTAACGGCCTACCAAGGCGATGATGCATAGCCGAGTTGAAAACTGATCGGCCACAATGGGACTGAGACACGGCCCATACTCCTACGGGAGGCAGCAGTAGGGAATCTTCTACAATGGGCGCAAGCCTGATGAACAACACCGCGTGAGTGAAGAAGGTTTTCGGCTCGTAAAGCTCTGTTGTTAAAGAAGAACACGTATAGAGTAACTGTTCATACGTTGACGGTATTTAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCTTACTTATCCGGATTTATTGGGCGTAAAGAGAGTGCAGGCGGTTTTCTAAGTCTGATGAAAAGCCTTCGGTAACCGGAGAAGTGCATCGGAAACTGGATAACTTGAGTGCAGAAGGGTAAATTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATATGGAAGAACACCAGTGGCGAAGGCAACGGCTACCTGGTCTGCAACTGACGCTGAGACTCAACATGGAGTAGCGAACAGGATTAGATACCCCTGTCCATGCCGTAAACGATGAGTGCTAGGTGTTGGAGGTTTCCGCCCTTCAGTGCCGGGCTAACGCATTCGCCTCCGCCTGGGGGTACGACTGCAAGTTCAAACTCCAAAGAAGAGATTCGACGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCTACGCGAAAAAGTGCCCGCCTGACATCTTGCGCAGCAGAGATGGGGCGTTTCAGGAACGCGCAATGACGGTGGTGCATGGTCGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAGCAGCACAACCCTTGTTACTAGTTGCCAGCATTAAGTTGGGCACTCTAGAGAGACCGCCGGTGACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGACCTGGGCTACACACGTGCTGCAATGGGCAGTACAACGAGAAGCGAACCCGCGAGGGTAGCGGATCTCTTAAAGCTGTTCTCAGTTCGGACTGCAGGCTGCAACTCGCCTGCACGAAGCTGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGCGCGCACACCGCCCCGTCACACCATGGAAGTCTGCAATGCCCAAAGTCGGTGGGATAACCTTTGCCGTCAATAGCCGCCTAAGGTAAGAATTGACTGGGGTG
+
-.@;:::8:97551111940+++)))*-255568:55667<;;=>>==<==;>@A@<)(&'(*+++/.144554435<<=9644434(''&(+((****''*,,13.)()**.012--,1/*&&&%*,(,--,),--000***078::-,,-<=>0-154447>?@<?>>?;;9;:9::>6,,,,<968:11//02?@<100/.'&&'))*,,-0/../37789<5598;;54435/600/(''40//,(&')*156>BA=>A=A??=:(((();>:;::::?/...99=<<9999:60-((((8;?;411++,.8;??A<5454676:569:767668==:<;259:854400--//.-20:9;987,***-<=>9966:54464720)''''23578<?<=<65579,/0+*))-17899:79--.../&&'+/122:C@B:</0+.,**,,9<<%$$'%(*+34>?>8888?<<=EA;::5443333?++*+///1(''.5443588;8843,,,,,,-002=7370/..(&%%&()(''''&&,-.70+-,-44@@A::9:>>><88/.-+,.9;;;<<?>;@;?32<.7<21//($$$$$%&')+'%$%(.1.01--)))))>>@//688679?A97.-*+,,-9:<;988**((+)'())*2456;7&&&%+-12433459;?<<@((&&%&)..1{3:=A<69854785429::7841.+(%%%&,-565585563347;??>?>==;9)((+36'&&''7:.-++888778/..-123::;0///5669=:6943304447:,++,,56:<;:869:99988:9);====<:2,+*))*/695500((/9<=222'&&%&'5.89:;654*%%()+-656/('%%'%%&(*+&')-10,)*('&'')*+-,-23:21324;956698:::8854***+<==<;:=<43532226''''-..++*028,.0-+,/24455A=<11../'&&*')*,,34546..-+,,:97726886899(&&&'13677<=>><;:338456790.-+*('()+12=@??=:::<42423,,--**(''*../01***76967+)))*&&%%%%&56641444><<---0;444{858114::78800.((.48235646978:;<98///>0/58<787,,++)(()))*0212'322397768{:34;;:+**,8::=>=<>>?;333..-,/4+*))*06>=>>:0//''225;<9;{D4434525677..,..6:><<;;;9:9:<<;<A:::9;<<<>>::22--/-----,6<>:999=<=@46867;><6667;=?3?<:998520001336500/0*+-)'&))/+632048=33-,,((((-232:0((('9><><??,.:><66669=>==79775*))$$$&()))-05:;:<0,,..*'&$%&'),+,//100

I am not sure that after renaming IDs the pipeline will be able to finish properly so I am waiting for my samples to finish to see if I will get OTU counts. Probably additional renaming back to original IDs will be needed after medaka.

I attached a modified file.

read_correction.zip

BTW, is this warning should be taken care of? WARNING: Output Nanopore_results/read_correction/medaka/C34F3_A_R1 already exists, may use old results.

timyerg commented 9 months ago

Thank you again for the tool. I like the output after Mumu reclustering. The run was completed successfully. Renaming IDs to solve this issue did not affect other steps. The only thing is that taxonomy (Silva with Mothur) was not added to the final outputs, though is present in corresponding folder. I checked if renamed IDs are involved, but as far as I can see, they are not already involved at this step. So I am not sure what the reason is. It is unimportant for me since I planned to reclassify it outside the Natrix2 anyway. Closing the issue.

Best,

adeep619 commented 9 months ago

Hi there, Thank you so much for your comments. we have resolved the issues for taxonomy in the final output and will push changes soon to the main pipeline. Your feedback helps us improve our service, so thank you for sharing it with us.

timyerg commented 9 months ago

Glad to help! One more question - I read the paper but did not find the answer - do you think species level annotations for 16S long nanopore reads are possible or it is better to work at genus level only (even if species were annotated)?

DanaBlu commented 9 months ago

Hello,

If you're considering taxonomic classification with mothur, I would advise against working at the species level. Mothur itself discourages this practice, as outlined on its official webpage (https://mothur.org/wiki/classify.seqs/#common-questions).

"How do you recommend classifying to the species level? Unfortunately I do not. You will never get species level classification if you are using the RDP or Silva references. They only go to the genus level. Even the greengenes database only has 10% or so of sequences with species level names (greengenes hasn’t been updated in quite a few years). I and many others would contend that using 16S and especially a fragment to get a species name is asking too much - you need a culture and genome sequencing to do that. If someone wanted to give it a shot, they would need to add the species level names to the taxonomy strings. Also, they would need to add many more sequences that represent each species. Outside of a few groups of bacteria where the researchers have carefully selected the region (e.g. Lactobacillus or Staphylococcus), I really think this would be a lot of work for little/no benefit."

If your objective is to work at the species level, an alternative approach could be to employ taxonomic classification via blast, setting the 'ident' parameter to your desired minimal identity overlap between the target and query sequences. This threshold can be customized in the https://github.com/dbeisser/Natrix2/blob/dev/Nanopore.yaml file's options.

Feel free to reach out if you have any further questions!

Best, Dana

timyerg commented 9 months ago

Thank you for detailed reply. I expected it, but decided to ask here in case of miracle.

Best, Timur