claraqin / neonMicrobe

Processing NEON soil microbe marker gene sequence data into ASV tables.
GNU Lesser General Public License v3.0
9 stars 4 forks source link

Fastq files must be in fastq.gz format for sequence processing pipeline #21

Closed claraqin closed 3 years ago

claraqin commented 4 years ago

Per a conversation between @zoey-rw and I, the .fastq files that are reorganized into the 0_raw subfolders at the end of the download-neon-data.Rmd vignette need to be turned into .fastq.gz files, or else this error is reached during the process-[16s/its]-sequences.Rmd vignette:

dnaio.exceptions.FastqFormatError: Error in FASTQ file at line 1: Line expected to start with '@', but found '\x1f'

The .fastq files are actually already compressed, so this can be addressed by include a line at the end of the download-neon-data.Rmd vignette that just appends ".gz" to the end of each filename.

claraqin commented 3 years ago

I got to the root of this issue: The filterAndTrim() function which is used by our qualityFilterITS() and qualityFilter16S() functions has an argument called compressed which is TRUE by default. This compresses the files output from the quality filter, but it doesn't update the filename to end with ".gz". After filtering, the ITS sequences undergo primer trimming (trimPrimersITS()), and this function expects files ending with ".gz" to be compressed and files without ".gz" to be uncompressed – it is getting confused by the mismatch in compression vs. filename.

Still need to fix this issue. It's being made tricky by the fact that the DADA pipelines currently rely on having constant filenames throughout the pipeline, and appending ".gz" would change those filenames.

claraqin commented 3 years ago

Fixed in latest commit by gzipping all files as a final step in organizeRawSequenceData().