dbeisser / Natrix2

Open-source bioinformatics pipeline for the preprocessing of raw amplicon sequencing / metabarcoding data.
MIT License
10 stars 2 forks source link

example data folder #5

Closed katfiishn closed 6 months ago

katfiishn commented 7 months ago

I am attempting to manually run natrix - just wondering where I can access the example_data folder, i.e. the example_data.csv, example_data.yaml and example amplicon datatset.

Thanks!

adeep619 commented 7 months ago

Hello,

Thank you for using our workflow :). Please use the Illumina data and either the Illumina_swarm.yaml or Illumina_vsearch.yaml along with Illumina.csv. If you are working with nanopore, use files and folders with the prefix "nanopore". Additionally, please use the dev branch as it contains all the latest changes.

we will update the readme as soon as possible. feel free to ask for any queries. best, aman

katfiishn commented 7 months ago

Ahh that makes total sense! Thanks Aman :)

katfiishn commented 7 months ago

Hi Aman,

I was also wondering if you have an example Nanopore.yaml for data that has yet not yet been demultiplexed? e.g. the yaml file corresponding to the Nanopore 18S V9.csv?

From my understanding of your paper, the natrix2 workflow for Nanopore data is Pychopper (demultiplexing) > CD-HIT (clustering) > Medaka + Racon (error-correction) > Minimap2 > Vsearch (chimera removal) > taxonomic classification (e.g. blast). However, interpreting the .yaml file is a bit confusing given a lot of the parameters listed (I think) are for Illumina processing. I am a bit confused, for example, where the pychopper parameters are in the .yaml file.

Thanks

DanaBlu commented 7 months ago

Hi @katfiishn,

Unfortunately, our pipeline does not currently support demultiplexing for Nanopore data. I recommend employing Nanopore-specific demultiplexing tools, such as Guppy, before proceeding with our amplicon workflow. If you want to make use of Pychopper for reorientation of your sequences, it's essential to conduct demultiplexing without adapter trimming prior to integrating it with our pipeline. If you don't want to integrate Pychopper, please trim your sequences during demultiplexing.

In Natrix2, Pychopper serves the sole purpose of identifying forward and reverse primers, ensuring uniform orientation for all reads. This step ensures consistency in the orientation of the reads.

For an enhanced Nanopore experience, I suggest utilizing the "dev" branch of our pipeline. This version includes updates specifically tailored for Nanopore data, optimizing the workflow. Additionally, the configuration files have been updated to provide clearer and more intuitive Nanopore- and Illumina-specific options.

Feel free to reach out if you have any further questions or require additional clarification.

Best, Dana

DanaBlu commented 7 months ago

Please have a look at the Nanopore.yaml configuration file within the 'dev' branch for a more intricate understanding of parameters specific to Nanopore or Illumina. Please refer to the following link: Nanopore.yaml - dev branch

katfiishn commented 6 months ago

Hi Dana, I ran the pipeline successfully on my Nanopore dataset using the dev version. However, I noticed a critical error in the clustering output:

The vsearch_uc.txt file does not match the vsearch_clusters_names.txt and subsequent output into the vsearch_table.csv From my investigating, the vsearch clustering (vsearch_uc.txt) is correct in how it has clustered the sequences (I specified 96% clustering), however I think there is something wrong with the write_clusters_uc script - as it has not clustered the correct sequences from the vsearch_uc.txt file.

I have attached both txt files. For example, in vsearch_uc.txt - on Line 5 vsearch indicates that 2879;size=8 and 5022;size=91 should be clustered together (at 98.6% similarity) - 5022;size=91 being the centroid. However, in the vsearch_clusters_names.txt file on Line 4: 2879;size=8 has been incorrectly outputted with 5;size=26. As such, their count data has been incorrectly merged in the successive vsearch_table.csv. I found this has occurred throughout the dataset and has incorrectly merged both the sequences and count data from different species, when they were never meant to be clustered together following the vsearch_uc.txt file.

vsearch_uc.txt vsearch_clusters_names.txt

DanaBlu commented 6 months ago

Hi @katfiishn,

Thank you for your feedback! We always appreciate it when people bring issues to our attention.

I've made the necessary changes to the corresponding scripts (merge_clust_results.py and vsearch_uc.py) in the dev branch. Please replace these scripts accordingly. To avoid rerunning the entire pipeline, it's sufficient to delete the "finalData," "mothur," and "clustering" folders in the output directory. This way, only the clustering and subsequent steps will be repeated.

Best, Dana

katfiishn commented 6 months ago

Hi Dana,

I re-ran it with your new scripts. The vsearch_clusters_names.txt seems to have created duplicates (see line 4 + 5 of attached file) and is still incorrectly clustering based on the results of the vsearch_uc.txt. The pipeline was not able to produce the vsearch_table.csv with the errors in the vsearch_clusters_names.txt.

vsearch_clusters_names.txt vsearch_uc.txt

Thanks for your help!

DanaBlu commented 6 months ago

Hi @katfiishn,

Thank you for your patience and your feedback! I've made some changes to the scripts. Now, the vsearch_all_otus_tab.txt file is utilized for creating both vsearch_clusters_names.txt and vsearch_table.csv, replacing the previous use of the vsearch_uc.txt file.

Here's a brief overview of the changes made to each script:

vsearch_uc.py: Updated to correctly handle cluster assignments using the vsearch_all_otus_tab.txt file.

merge_clust_results.py: Adjusted to incorporate changes from vsearch_uc.py and ensure compatibility with the updated workflow.

vsearch_clust.smk: Updated input file from vsearch_uc.txt to vsearch_all_otus_tab.txt.

Please test these changes with your dataset and provide any feedback or report any issues you encounter. The changes have been tested with our test dataset, so I don't expect any issues with your data.

Looking forward to hearing from you!

Best regards, Dana

katfiishn commented 6 months ago

Hi Dana,

This seems to have worked! Thanks very much!

Just wondering if there is a fasta file that matches the final vsearch_table.csv?

I see there is the vsearch_all_otus.fasta to match the vsearch_all_otus_tab.txt, but that obviously contains all sequences, not the final consensus cluster sequences that is in vsearch_table.csv.

Would I have to enable the mothur component to get that final fasta file?

Thanks for all your help :)

DanaBlu commented 6 months ago

Hi @katfiishn,

In our pipeline, we utilize the sequence of the representative (centroid) OTU for subsequent analysis rather than the consensus sequence of the OTU.

If you intend to work with the consensus sequences outside of the pipeline, you have the option to include the --consout flag in the vsearch cluster rule. While we don't have plans to utilize the consensus sequences in our pipeline, I've created an example version of the rule for your convenience.

vsearch_clust.txt

Please rename the file extension from .txt to .smk after downloading it, as .smk files cannot be attached directly here. Also ensure it still meets the requirements for a Snakemake file.

I hope I was able to help you! :)

Jorn-Bethke commented 6 months ago

@katfiishn Hi I'm just started with the use of Natrix and it seems you were able to successfully use your Nanopore data.

I have data pod5 data from a single barcode and I did the basecall with dorado:

$ dorado basecaller --no-trim --barcode-both-ends --kit-name SQK-16S024 ../dorado_0.5.3/models/dna_r9.4.1_e8_fast@v3.4/ ./barcode04/ | samtools fastq -T '*' - > ./reads/BC04_A_R1.fastq

Could you please guide me through on how to successfully perform a run of the natrix pipeline using those reads as input.

hope you can help, best regards, Jorn.

katfiishn commented 6 months ago

Hi Jorn,

As Aman and Dana advised me, you need to have demultiplexed, un-trimmed fastq files as the input. I can't really advise on the installation process as I ran it on a supercomputer and we had to make some changes. However once you have it installed, you place the folder containing you demultiplexed files under the Natrix2 folder and then run the Nanopore.yaml script (you will need to edit the parameters on the script) to execute the run. It took a good day to install the conda environment containing all the dependent software and the run itself (once I could fix problems that came up) took a good 1-2 days to run through with the data size I had. I think overall it took 2-3 weeks to get the finalised OTU table and fasta file I was after (thanks Aman and Dana!!).

Cheers!

Jorn-Bethke commented 6 months ago

Thanks, @katfiishn and @DanaBlu for the bits of advice and goodwill. I hope I can get it to run properly.

best, Jorn