Scope of Testing

For now, we will be testing the steps in the figure below between "NEON Data Products" and "ASV tables with taxonomy," except that we will not be generating the taxonomy tables because this takes too much processing time.

Screen Shot 2020-10-26 at 12 42 36 PM

Our Technical Working Group has suggested that this testing should occur in two phases. In Phase 1, we test the pipeline to ensure that the pipeline is simply able to run from start to finish on a variety of operating systems. In Phase 2, we will ask volunteers to read through the docs and provide suggestions on how to make the package more flexible and user-friendly. For now, we are only asking you to conduct Phase 1 testing.

Instructions for Phase 1 Testing

Start by pulling the codebase from https://github.com/claraqin/NEON_soil_microbe_processing.

I recommend using git clone, e.g. "git clone https://github.com/claraqin/NEON_soil_microbe_processing.git"

Install cutadapt if you have not previously done so.

Installation instructions can be found here: https://cutadapt.readthedocs.io/en/stable/installation.html
This is where many people run into issues because of Python dependencies. If you cannot install cutadapt, then ignore the ITS pipeline and test the 16S processing pipeline only. (The 16S pipeline does not require cutadapt.)

Update the parameters in the "params.R" file, which can be found in the "code" subdirectory.

Most of the parameters will not need to be updated because they are either adaptable or will not be referenced in this scope of testing.
However, you may need to update the CUTADAPT_PATH parameter if you are testing the ITS pipeline.
If you are using a Mac, you may also wish to update the MULTITHREAD parameter. By default, multithreading is turned off for Windows computers.

Download the sequence metadata for testing at this Google Drive link, decompress the zipfile, and drop the contents (two files) into the project directory (the base directory of the repository that you just cloned).

In the future, this step will be replaced by a function made specifically for downloading sequence metadata from NEON. But for now, we need to use a workaround because of compatibility issues on NEON's end which will be resolved later this year.

The code for testing can be found in the "testing" subdirectory. This subdirectory contains temporary versions of our vignettes that I made for testing only. Start with the download-neon-data-metadataworkaround.Rmd vignette.

You will probably have to update the "root.dir" RMarkdown parameter at the top of the script. It should refer to the absolute filepath of the project root directory (e.g. .../neonSoilMicrobeProcessing).
Note that the R package dependencies, specified in lines 32-36, must be installed before this vignette will run properly.
In lines 81-82, you will have the option to download either the metadata for ITS sequences or the metadata for 16S sequences (or both). Please respond to this Issue thread to let the other testers know which target gene(s) you will test.
In lines 89-101, different options of subsetting parameters are provided. You could attempt to download and process the entire dataset if you'd like, but I do not even have an estimate of the full download size because these metadata tables include both published and pre-published NEON data. If you do subset the data, please respond to this Issue thread to let the other testers know which subset(s) you will test.

Then move to either the process-its-sequence-to-seqtabs.Rmd or process-16s-sequence-to-seqtabs.Rmd vignettes, depending on which subset of the data you selected.

You will probably have to update the "root.dir" RMarkdown parameter at the top of the script. It should refer to the absolute filepath of the project root directory (e.g. .../neonSoilMicrobeProcessing).
Note that the R package dependencies, specified in lines 30-34, must be installed before this vignette will run properly.
Both vignettes contain a header which says "All code below is NOT run in this version of the vignette." Please run only the code above this header.
Note that each sequencing run (the unit by which we are subsetting) takes anywhere between 1 and 4 hours to process, depending on the size of the run and the speed of your processor. I've found that 8 GB of RAM is usually sufficient for running this pipeline, but occasionally more RAM is needed.

Reporting Back

If any issues or fatal errors arise, please let me know by replying to me individually (unless of course it seems obvious that it would affect all testers).

Whether you run into a fatal error or are able to complete the pipeline error-free, please report back on this thread and include in your post the output of devtools::session_info().

Current Volunteer Assignments

Kabir has tested the ITS pipeline on a Mac for the following subset of data: c("B69PP", "B69RF", "B69RN", "B9994", "BDR3T", "BF8M2", "BF8W6", "BFDG8", "BMCBD", "BMCC4", "BNBWL").
Dan is currently testing the 16S pipeline on a Mac for the following subset of data: c("B69PP", "B69RF", "B69RN", "B9994", "BDNB6", "BF462", "BF8M2", "BFDG8", "BJ8RK", "BMC64", "BMCBJ".
Kai is currently testing the 16 pipeline on a Windows VM and printing the results in this Issue thread: #26

claraqin commented 3 years ago

Dan Liptzin found another error (or two errors) at the end of the DADA2 portion of "Process 16 Sequences":

column names ‘y.x’, ‘y.y’ are duplicated in the resultFinished processing reads in runB69RF at 2020-11-05 01:55:42

Sequencing run-specific sequence tables can be found in /Users/danielliptzin/Dropbox/NEON/Pipeline_Testing/NEON_soil_microbe_processing-master/NEON/raw_sequence/16S/3_seqtabs

Began processing runB69RN at 2020-11-05 01:55:42

Error in filterAndTrim(fnFs, fnFs.cut, fnRs, fnRs.cut, multithread = multithread,  :

  Paired forward and reverse input files must correspond.

Update 11/10/2020: The first of the above two errors, which was actually a non-fatal warning, has been fixed. I have been unable to reproduce the second error.

claraqin commented 3 years ago

EDIT 11/10/2020:

On second thought, I may be getting ahead of myself. @lstanish is currently working on a set of improvements to the first vignette (Download NEON Data) that involves changes to the functions and to the ways we work with the metadata. I think it will make more sense to move onto this second phase of testing after she has completed these. @lstanish if you think that makes sense, can you please note here when you've finished it? I'll try to respond to any questions or concerns as soon as possible.

Hi everyone, (especially new testers @kjnaithani and @yyachung)

I had hoped to create a more structured set of instructions for "Phase 2" testing before going on my two-week leave which begins tomorrow, but I wasn't able to complete them. In short, though, I am asking volunteers to read through the docs and provide suggestions on how to make the package more flexible and user-friendly.

When I say "docs", I am referring to the function documentation and the vignettes, since these will be downloaded as part of the R package.

You can preview some of the documentation that I've written by viewing the raw Roxygen files in the folder called "man". You can also simulate looking up the help pages for the functions using the ? command by first pulling the recent changes, running devtools::document() in the project directory, and running something like ?downloadSequenceMetadata. The documentation should pop up on the RStudio Help panel as usual. You can also view the raw documentation directly by reading the "roxygen" comments above each function in ./R/utils.R.
To view the vignettes, go to ./testing/ to view the versions of the vignettes with slight modifications for the testing pipeline. These are .Rmd files, however, so they are not formatted as nicely as a markdown document itself. For these, you will have to Knit the .Rmd files, turning them into HTML documents or PDFs.

Please note any confusing wording, errors, or suggestions you come across in this Issue thread, unless it is sufficiently confusing that it deserves its own thread.

claraqin commented 3 years ago

@yyachung encountered this error while running the 16S processing pipeline:

I got the download data vignette to work. I’m trying the “process-16s-sequences-to-seqtabs.Rmd” with run C38TW 16s data. I hit an error running the “Process reads” code chunk: “Error in add(bin) : record does not start with '@'”

claraqin / neonMicrobe

Instructions for pipeline testing #27

Scope of Testing

Instructions for Phase 1 Testing

Reporting Back

Current Volunteer Assignments