claraqin / neonMicrobe

Processing NEON soil microbe marker gene sequence data into ASV tables.
GNU Lesser General Public License v3.0
9 stars 4 forks source link

Instructions for pipeline testing #27

Open claraqin opened 3 years ago

claraqin commented 3 years ago

Migrating this from an email thread for easier viewing.

Scope of Testing

For now, we will be testing the steps in the figure below between "NEON Data Products" and "ASV tables with taxonomy," except that we will not be generating the taxonomy tables because this takes too much processing time.

Screen Shot 2020-10-26 at 12 42 36 PM

Our Technical Working Group has suggested that this testing should occur in two phases. In Phase 1, we test the pipeline to ensure that the pipeline is simply able to run from start to finish on a variety of operating systems. In Phase 2, we will ask volunteers to read through the docs and provide suggestions on how to make the package more flexible and user-friendly. For now, we are only asking you to conduct Phase 1 testing.

Instructions for Phase 1 Testing

Start by pulling the codebase from https://github.com/claraqin/NEON_soil_microbe_processing.

Install cutadapt if you have not previously done so.

Update the parameters in the "params.R" file, which can be found in the "code" subdirectory.

Download the sequence metadata for testing at this Google Drive link, decompress the zipfile, and drop the contents (two files) into the project directory (the base directory of the repository that you just cloned).

The code for testing can be found in the "testing" subdirectory. This subdirectory contains temporary versions of our vignettes that I made for testing only. Start with the download-neon-data-metadataworkaround.Rmd vignette.

Then move to either the process-its-sequence-to-seqtabs.Rmd or process-16s-sequence-to-seqtabs.Rmd vignettes, depending on which subset of the data you selected.

Reporting Back

If any issues or fatal errors arise, please let me know by replying to me individually (unless of course it seems obvious that it would affect all testers).

Whether you run into a fatal error or are able to complete the pipeline error-free, please report back on this thread and include in your post the output of devtools::session_info().

Current Volunteer Assignments

claraqin commented 3 years ago

Dan Liptzin found another error (or two errors) at the end of the DADA2 portion of "Process 16 Sequences":

column names ‘y.x’, ‘y.y’ are duplicated in the resultFinished processing reads in runB69RF at 2020-11-05 01:55:42

Sequencing run-specific sequence tables can be found in /Users/danielliptzin/Dropbox/NEON/Pipeline_Testing/NEON_soil_microbe_processing-master/NEON/raw_sequence/16S/3_seqtabs

Began processing runB69RN at 2020-11-05 01:55:42

Error in filterAndTrim(fnFs, fnFs.cut, fnRs, fnRs.cut, multithread = multithread,  :

  Paired forward and reverse input files must correspond.

Update 11/10/2020: The first of the above two errors, which was actually a non-fatal warning, has been fixed. I have been unable to reproduce the second error.

claraqin commented 3 years ago

EDIT 11/10/2020:

On second thought, I may be getting ahead of myself. @lstanish is currently working on a set of improvements to the first vignette (Download NEON Data) that involves changes to the functions and to the ways we work with the metadata. I think it will make more sense to move onto this second phase of testing after she has completed these. @lstanish if you think that makes sense, can you please note here when you've finished it? I'll try to respond to any questions or concerns as soon as possible.


Hi everyone, (especially new testers @kjnaithani and @yyachung)

I had hoped to create a more structured set of instructions for "Phase 2" testing before going on my two-week leave which begins tomorrow, but I wasn't able to complete them. In short, though, I am asking volunteers to read through the docs and provide suggestions on how to make the package more flexible and user-friendly.

When I say "docs", I am referring to the function documentation and the vignettes, since these will be downloaded as part of the R package.

  1. You can preview some of the documentation that I've written by viewing the raw Roxygen files in the folder called "man". You can also simulate looking up the help pages for the functions using the ? command by first pulling the recent changes, running devtools::document() in the project directory, and running something like ?downloadSequenceMetadata. The documentation should pop up on the RStudio Help panel as usual. You can also view the raw documentation directly by reading the "roxygen" comments above each function in ./R/utils.R.
  2. To view the vignettes, go to ./testing/ to view the versions of the vignettes with slight modifications for the testing pipeline. These are .Rmd files, however, so they are not formatted as nicely as a markdown document itself. For these, you will have to Knit the .Rmd files, turning them into HTML documents or PDFs.

Please note any confusing wording, errors, or suggestions you come across in this Issue thread, unless it is sufficiently confusing that it deserves its own thread.

claraqin commented 3 years ago

@yyachung encountered this error while running the 16S processing pipeline:

I got the download data vignette to work. I’m trying the “process-16s-sequences-to-seqtabs.Rmd” with run C38TW 16s data. I hit an error running the “Process reads” code chunk: “Error in add(bin) : record does not start with '@'”