claraqin / neonMicrobe

Processing NEON soil microbe marker gene sequence data into ASV tables.
GNU Lesser General Public License v3.0
9 stars 4 forks source link

Soil database construction cannot handle missing data #13

Closed zoey-rw closed 4 years ago

zoey-rw commented 4 years ago

The soil data download functions (downloadRawSoilData() and downloadAllRawSoilData()) throw an error if there is nothing to retrieve from soil chemistry (DP1.10107.001) or soil physical properties (DP1.10108.001). Since the output of the downloading script is passed into the SQLite database in create_soil_database.R, it will causes downstream issues if the function output is missing columns. An error occurs for these preset parameters, which have no soil chemistry data:

PRESET_SITES = c("OSBS", "CPER", "CLBJ")
PRESET_START_YR_MO = "2018-03"
PRESET_END_YR_MO = "2018-07"
TARGET_GENE = "16S"
SEQUENCING_RUNS = c("C5B2R")

The name of the function downloadAllRawSoilData() can be misleading, since it relies on the preset parameters rather than downloading all of NEON's available soil data - one option could be for downloadAllRawSoilData() to (by default) download all the data for preset sites, or for all sites? (not sure how long this takes, though).

Another approach may be to create a full SQLite database and store it 1) remotely, or 2) within the R package, and then the database can be called to match the sequencing files. I'm not sure how much memory this full database would take up at this point in time, but it might be a manageable size. One implementation of the remote approach is within the metScanR package:

"The DB is updated frequently and hosted externally of the metScanR package. Upon loading the metScanR package via library(metScanR), the DB is accessed via internet connection and installed locally to the user's computer via metScanR's updateDatabase() function."

claraqin commented 4 years ago

Thanks Zoey! I think we should probably remove the SQLite component and rely on regular joined tables. I also meant to delete the downloadAllRawSoilData() function, since its usefulness is limited. I can take care of this before our meeting in 1.5 weeks.

zoey-rw commented 4 years ago

I actually am about to push a new script that does just this (removes the SQLite component), which makes the dataframe lightweight enough to include with the package if we want to do that. There are still some file path things to work out, but I think it should work fine!

claraqin commented 4 years ago

Fixed in the latest commit! The new downloadRawSoilData function behaves a lot like the downloadSequenceMetadataRev function that Lee designed.