MMARINeDNA / metabarcoding_QAQC_pipeline

Pipeline scripts for initial quality control, ASV assignment, taxonomic classification, and preliminary data visualization
0 stars 3 forks source link

Preliminary analysis fails #26

Open nvpatin opened 9 months ago

nvpatin commented 9 months ago

Hi team! I think I'm at a point where I can start posting issues here rather than individually emailing people, particularly because I think some problems will be widespread once others try to use the pipeline.

Although my "final_data" folder contains all the output files described in the wiki, the "analysis_output" folder is always empty. The wiki says for the preliminary analysis: "This is a quarto file that will take in the output files from DADA2 and create plots and statistics regarding read retention, read lengths, quality, and more." I think this is supposed to be an HTML output? In any case it would be great to get that final step working.

I think two issues might be preventing the preliminary analysis: 1) read file names that are different from the required format and 2) a metadata sheet that is different from the required format.

1) Although I always try to rename the fastq files according to the formula, it's possible something is off. Here is an example of a read pair that I've renamed: MFU-FISH_001-d1-1_S1_L001_R1_001.fastq.gz and MFU-FISH_001-d1-1_S1_L001_R2_001.fastq.gz. Simplifying the requirements for raw read file names would be a huge help; I'm always nervous about renaming any raw data.

2) The sample sheets we get from our sequencing center are quite different from the example sheet for the pipeline. I've uploaded an example raw file ("SampleSheet.csv") as well as a modified sheet that I made manually to try to match the example sheet ("SampleSheetUsed.csv"). In the long run, this is a big pain in the butt! Maybe we can simplify the metadata sheet requirements? I suspect something about the sample sheet is preventing the preliminary analysis but I'm not sure. In the last step of the pipeline I get an error that says "Run name not found." SampleSheet.csv SampleSheetUsed.csv

avancise commented 9 months ago

Hey Nastassia,

Nice job getting to this point, that's awesome! The requirements for the html report are fairly specific to the MURI Mod 3 naming conventions and sample sheet, you're right. I think your two options from this point are:

1) adjust your sample sheet and sample names to match the format used by Mod 3 - if you want to go this route and are having trouble matching these formats, we can have a chat about how to do that. As long as you're using the fields in the sample naming conventions https://github.com/MMARINeDNA/metabarcoding_QAQC_pipeline/wiki/Preparing-Your-Data#sample-naming-conventions section the wiki, with no extra field or fewer fields, you should be good. It looks like you're using "MFU-FISH" as the primer name, which won't match our metadata sheets that call MiFish "MFU", so you'll likely need to adjust that to "MFU" 2) download the .Rmd file that makes the html report and adjust the code there to fit the format of your sample names and sample sheet. That might require a bit more up front work on your part right now, but down the line would likely require less adjustment of sample names and sample sheets.

Cheers, Amy

<)))>< <)))>< <)))>< <)))>< <)))>< <)))>< <)))>< <)))>< Amy M. Van Cise, Ph.D. (she/her/hers)

Assistant Professor Whale and Dolphin Ecology Lab http://amyvancise.com University of Washington | School of Aquatic and Fisheries Sciences 1122 NE Boat St, Box 355020 Seattle, WA 98105 Office: SAFS 216B 206-221-6118

Need to meet with me? Let's find a time https://calendar.app.google/6S7FAok44L6n2TpF7.

Where is Amy? [Summer 2023 edition]** Monday: UW campus Tuesday: UW campus Wednesday: UW campus Thursday: NOAA NWFSC Genetics lab Friday: UW campus

**This is not exact. If you can't find me, shoot me an email and I will get back to you.

"My paper was one long gigantic blunder from beginning to end." -Charles Darwin

On Wed, Oct 11, 2023 at 6:08 PM Nastassia Patin @.***> wrote:

Hi team! I think I'm at a point where I can start posting issues here rather than individually emailing people, particularly because I think some problems will be widespread once others try to use the pipeline.

Although my "final_data" folder contains all the output files described in the wiki, the "analysis_output" folder is always empty. The wiki says for the preliminary analysis: "This is a quarto file that will take in the output files from DADA2 and create plots and statistics regarding read retention, read lengths, quality, and more." I think this is supposed to be an HTML output? In any case it would be great to get that final step working.

I think two issues might be preventing the preliminary analysis: 1) read file names that are different from the required format and 2) a metadata sheet that is different from the required format.

1.

Although I always try to rename the fastq files according to the formula, it's possible something is off. Here is an example of a read pair that I've renamed: MFU-FISH_001-d1-1_S1_L001_R1_001.fastq.gz and MFU-FISH_001-d1-1_S1_L001_R2_001.fastq.gz. Simplifying the requirements for raw read file names would be a huge help; I'm always nervous about renaming any raw data. 2.

The sample sheets we get from our sequencing center are quite different from the example sheet for the pipeline. I've uploaded an example raw file ("SampleSheet.csv") as well as a modified sheet that I made manually to try to match the example sheet ("SampleSheetUsed.csv"). In the long run, this is a big pain in the butt! Maybe we can simplify the metadata sheet requirements? I suspect something about the sample sheet is preventing the preliminary analysis but I'm not sure. In the last step of the pipeline I get an error that says "Run name not found." SampleSheet.csv https://github.com/MMARINeDNA/metabarcoding_QAQC_pipeline/files/12875927/SampleSheet.csv SampleSheetUsed.csv https://github.com/MMARINeDNA/metabarcoding_QAQC_pipeline/files/12875930/SampleSheetUsed.csv

— Reply to this email directly, view it on GitHub https://github.com/MMARINeDNA/metabarcoding_QAQC_pipeline/issues/26, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZISFCULZVEO4V6M4K3ANTX647IHANCNFSM6AAAAAA542QBZI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

nvpatin commented 8 months ago

Thanks Amy! I can definitely work on Option 2 to edit the .Rmd file. For what it's worth, I tried changing the file names again to match the formula 100%, but that didn't help so it must be about the metadata sheet. I also think there might be a missing R module in the Docker image; see below for the full error message from my most recent analysis.

In the long run, if this is a tool we want to disseminate to other labs or scientists, I think it will be important to incorporate more flexibility in the file names and formats. I may be able to help with some of that with my .Rmd edits. Will keep everyone posted.

pipeline error message:

[1] "Starting Taxonomy Assignment at 2023-10-12 21:21:23.264674" Finished processing reference fasta.[1] "Finished Taxonomy Assignment at 2023-10-12 21:22:17.561918 ." Warning messages: 1: In grSoftVersion() : unable to load shared object '/usr/local/lib/R/modules//R_X11.so': libXt.so.6: cannot open shared object file: No such file or directory 2: In min(which(window_values < primer.data$F_qual[i])) : no non-missing arguments to min; returning Inf 3: In min(which(window_values < primer.data$F_qual[i])) : no non-missing arguments to min; returning Inf 4: In min(which(window_values < primer.data$F_qual[i])) : no non-missing arguments to min; returning Inf 5: In min(which(window_values < primer.data$F_qual[i])) : no non-missing arguments to min; returning Inf 6: In min(which(window_values < primer.data$R_qual[i])) : no non-missing arguments to min; returning Inf 7: Using all_of() outside of a selecting function was deprecated in tidyselect 1.2.0. ℹ See details at https://tidyselect.r-lib.org/reference/faq-selection-context.html finished step 1. 21:22:17 starting step 3: making the stats file... 21:22:20 finished step 3. 21:22:23 metabarcoding pipeline complete! 21:22:25

avancise commented 8 months ago

Thanks Nastassia! Keep us posted on how it goes.

I don't see an error message in what you copied, just a couple warnings that seem to be referring to missing values in your window_values vector, which is generated early in the dada2 pipeline. Warning messages generally do not cause code to stop running, so there may be something else happening.

Just to clarify on future plans - as of now, we don't have plans to formally disseminate this to other labs. That could change in the future, but the decision would be up to Ryan. Rather, this pipeline has been developed for use by MURI, but we make it publicly available so that folks who would like to use it can (at their own risk, and their own responsibility). The pipeline is fully based on previously published metabarcoding QAQC pipelines, and I think it's in the best interest of other labs to build their own pipelines using the existing resources, so that they can be sure that they understand the various steps in their pipelines and that they're optimized to fit the specific needs of that lab/project.

<)))>< <)))>< <)))>< <)))>< <)))>< <)))>< <)))>< <)))>< Amy M. Van Cise, Ph.D. (she/her/hers)

Assistant Professor Whale and Dolphin Ecology Lab http://amyvancise.com University of Washington | School of Aquatic and Fisheries Sciences 1122 NE Boat St, Box 355020 Seattle, WA 98105 Office: SAFS 216B 206-221-6118

Need to meet with me? Let's find a time https://calendar.app.google/6S7FAok44L6n2TpF7.

Where is Amy? [Summer 2023 edition]** Monday: UW campus Tuesday: UW campus Wednesday: UW campus Thursday: NOAA NWFSC Genetics lab Friday: UW campus

**This is not exact. If you can't find me, shoot me an email and I will get back to you.

"My paper was one long gigantic blunder from beginning to end." -Charles Darwin

On Mon, Oct 16, 2023 at 1:39 PM Nastassia Patin @.***> wrote:

Thanks Amy! I can definitely work on Option 2 to edit the .Rmd file. For what it's worth, I tried changing the file names again to match the formula 100%, but that didn't help so it must be about the metadata sheet. I also think there might be a missing R module in the Docker image; see below for the full error message from my most recent analysis.

In the long run, if this is a tool we want to disseminate to other labs or scientists, I think it will be important to incorporate more flexibility in the file names and formats. I may be able to help with some of that with my .Rmd edits. Will keep everyone posted.

pipeline error message:

[1] "Starting Taxonomy Assignment at 2023-10-12 21:21:23.264674" Finished processing reference fasta.[1] "Finished Taxonomy Assignment at 2023-10-12 21:22:17.561918 ." Warning messages: 1: In grSoftVersion() : unable to load shared object '/usr/local/lib/R/modules//R_X11.so': libXt.so.6: cannot open shared object file: No such file or directory 2: In min(which(window_values < primer.data$F_qual[i])) : no non-missing arguments to min; returning Inf 3: In min(which(window_values < primer.data$F_qual[i])) : no non-missing arguments to min; returning Inf 4: In min(which(window_values < primer.data$F_qual[i])) : no non-missing arguments to min; returning Inf 5: In min(which(window_values < primer.data$F_qual[i])) : no non-missing arguments to min; returning Inf 6: In min(which(window_values < primer.data$R_qual[i])) : no non-missing arguments to min; returning Inf 7: Using all_of() outside of a selecting function was deprecated in tidyselect 1.2.0. ℹ See details at https://tidyselect.r-lib.org/reference/faq-selection-context.html finished step 1. 21:22:17 starting step 3: making the stats file... 21:22:20 finished step 3. 21:22:23 metabarcoding pipeline complete! 21:22:25

— Reply to this email directly, view it on GitHub https://github.com/MMARINeDNA/metabarcoding_QAQC_pipeline/issues/26#issuecomment-1765235807, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZISFGWBXT77F4M3LJ6SZDX7WLQRAVCNFSM6AAAAAA542QBZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGIZTKOBQG4 . You are receiving this because you commented.Message ID: @.***>

nvpatin commented 8 months ago

Right, sorry, it's not really an error message but I was wondering if there might be a link between the warnings and the failure to run the final step. Most likely it's due to the metadata sheet though so I'll try solving that first.

Re: pipeline availability, got it, thanks for clarifying. I'll just see what I can do to get it working for our sequence data files.

nvpatin commented 8 months ago

Just to confirm is it the "Report_MURI_Module3.qmd" file that generates the preliminary analyses? I can't find a .Rmd file in the file system.

avancise commented 8 months ago

yup! qmd and Rmd files are interchangeable - qmd just refers to quarto, the newer, spiffer version of Rmarkdown. But both read both.

<)))>< <)))>< <)))>< <)))>< <)))>< <)))>< <)))>< <)))>< Amy M. Van Cise, Ph.D. (she/her/hers)

Assistant Professor Whale and Dolphin Ecology Lab http://amyvancise.com University of Washington | School of Aquatic and Fisheries Sciences 1122 NE Boat St, Box 355020 Seattle, WA 98105 Office: SAFS 216B 206-221-6118

Need to meet with me? Let's find a time https://calendar.app.google/6S7FAok44L6n2TpF7.

Where is Amy? [Summer 2023 edition]** Monday: UW campus Tuesday: UW campus Wednesday: UW campus Thursday: NOAA NWFSC Genetics lab Friday: UW campus

**This is not exact. If you can't find me, shoot me an email and I will get back to you.

"My paper was one long gigantic blunder from beginning to end." -Charles Darwin

On Mon, Oct 16, 2023 at 2:17 PM Nastassia Patin @.***> wrote:

Just to confirm is it the "Report_MURI_Module3.qmd" file that generates the preliminary analyses? I can't find a .Rmd file in the file system.

— Reply to this email directly, view it on GitHub https://github.com/MMARINeDNA/metabarcoding_QAQC_pipeline/issues/26#issuecomment-1765292353, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZISFDFJIUOFI6ICTCZJLDX7WP6RAVCNFSM6AAAAAA542QBZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRVGI4TEMZVGM . You are receiving this because you commented.Message ID: @.***>