claraqin / neonMicrobe

Processing NEON soil microbe marker gene sequence data into ASV tables.
GNU Lesser General Public License v3.0
9 stars 4 forks source link

Make selective downloading more robust #8

Closed claraqin closed 4 years ago

claraqin commented 4 years ago

Improvements should be made to the sequence data downloading functions in utils.R to make them more robust. Currently they assume a standard file naming structure, which means that they may not be robust to minute changes in naming conventions.

Suggestions from @lstanish :

In my experience the best way to ensure you are filtering for the correct sequence data is to start by downloading the metadata and doing some table joining and filtering. Here’s some example code using neonUtilities that combines the respective sequencing data table (16S or ITS) with the raw data files table. What you end up with is a data.frame containing the rawDataFIleNames for just the 16S or ITS data

library(neonUtilities)
library(plyr)
library(dplyr)

mmgL1 <- loadByProduct('DP1.10108.001', package = 'expanded', check.size = F) # output is a list of each metadata file

# extract lists into data. frames
seq16S <- mmgL1$mmg_soilMarkerGeneSequencing_16S
seqITS <- mmgL1$mmg_soilMarkerGeneSequencing_ITS
raw <- mmgL1$mmg_soilRawDataFiles

# double check that strings are of class character and not factor/logical

# Join 16S metadata
raw16S <- raw[-grep("ITS", raw$rawDataFileName), ]
joined16S <- left_join(raw16S, seq16S, by=c('dnaSampleID', 'sequencerRunID', 'internalLabID'))
joined16S <- joined16S[!is.na(joined16S$uid.y), ]

# Join ITS metadata
rawITS <- raw[-grep("16S", raw$rawDataFileName), ]
joinedITS <- left_join(rawITS, seqITS, by=c('dnaSampleID', 'sequencerRunID', 'internalLabID'))
joinedITS <- joinedITS[!is.na(joinedITS$uid.y), ]

From here you can subset the data to include only the run or samples of interest.

lstanish commented 4 years ago

@claraqin Questions on modifying the metadata downloading code:

claraqin commented 4 years ago

Sorry for the slow response.

lstanish commented 4 years ago

@claraqin I created a new version of the metadata downloading function, called downloadSequenceMetadataRev(). Did a bit of testing, but would be great for you and/or others in the group to also test and provide feedback.

Will shift to working on a revised sequence data downloading function that uses neonUtilities.

lstanish commented 4 years ago

@claraqin Thanks for the code additions, the new functionality is great! Here's a list of the most recent updates:

lstanish commented 4 years ago

@claraqin one other question: at some point do you want to clean up the older versions of functions in utils.R? Would be good to update the name of this function so that it isn't a 'rev' anymore. No rush, just wanted to write it down so we don't forget!

claraqin commented 4 years ago

Hi Lee,

Thanks for bringing this up! I'll clean up the older versions in my next commit. And sorry for the slow response – I don't know how to receive notifications for Issue thread replies. I'm making a note to figure that out too.

Clara

claraqin commented 4 years ago

I think this has been resolved as of the most recent commit, which gives downloadSequenceMetadataRev the ability to handle tarballs.