NEONScience / NEON-utilities

Utilities and scripts for working with NEON data. Currently: an R package with functions to join (stack) the month-by-site files in downloaded NEON data, to convert data to geoCSV format, and to download data from the API.
GNU Affero General Public License v3.0
57 stars 36 forks source link

stackEddy fails when folder contains .gz files #104

Closed rfiorella closed 3 years ago

rfiorella commented 3 years ago

Function stackEddy

Describe the bug NEON has started gzipping the EC files available through the API - if a user downloads a zip file from the API using zipsByProduct, for example, stackEddy will crash if the unzipped archive contains .h5.gz files instead of .h5 files.

To Reproduce zipsByProduct("DP4.00200.001", site = "ONAQ", startdate = "2019-01-01", enddate = "2019-12-31", savepath = "~/Desktop/NEON/", check.size = FALSE)

lhflux <- stackEddy("~/Desktop/NEON/filesToStack00200", avg = 30)

Expected behavior stackEddy to look through the files and bind all of the dp04 data to a list.

System (please complete the following information):

Additional context I've already developed a candidate fix that's available on my fork of NEON-utilities, under the gzip-stackeddy-fix - feel free to open a PR. I tried to open one, but since the feature branch doesn't exist in the NEONScience repo, I only open a PR to the master branch, which I would guess would not be the preferred approach initially.

The fix I've uploaded will likely require moving R.utils from suggests to imports in DESCRIPTION, and corresponding updates to NAMESPACE

cklunch commented 3 years ago

@rfiorella Thanks for the alert about this! It should be an easy fix, thanks for sharing the code on your fork. There are some other pending changes to stackEddy() in the stack-from-store branch (which isn't fully functional yet), so I just need to decide the best way to integrate this fix. I'll let you know.

rfiorella commented 3 years ago

No problem @cklunch! There's an additional issue that also comes up that's a little more difficult to fix from the user side: current zip files will often contain multiple versions of EC data for a given site and month (e.g., HARV in 2019 has 3 versions of the monthly basic file with different processing dates wrapped up in the zip file).

I think it would be pretty straightforward to look for 'duplicates' here, (optionally?) have stackEddy gzip the old ones, and then stack the most recent version of each site month. I'm happy to take a crack at this but it might be a little bit before I can get to it.

cklunch commented 3 years ago

@rfiorella Yeah, the duplicate files are a known issue, the cyberinfrastructure team is working on fixing it. So in theory, it will be fixed and become a non-issue soon, but you make a good point that it could also be handled pretty easily in the stacking code. I'll take a look at doing that along with the other code changes.

libonancaesar commented 3 years ago

I found the similar error while using "stackEddy()" function. ![Uploading image.png…]()

cklunch commented 3 years ago

@rfiorella @libonancaesar Fix is in on the stack-from-store branch. Other functions aren't ready to go on that branch, but the stackEddy() function there is handling the .gz files and the duplicate files. I'm hoping to be able to merge the branch in later this week.

cc @cflorian1

rfiorella commented 3 years ago

@cklunch Great, thanks! I'll checkout the stack-from-store branch and test it out.

cklunch commented 3 years ago

@rfiorella There was a bug in the duplicate file handling, now fixed.

rfiorella commented 3 years ago

@cklunch can confirm that stackEddy now runs in the cases where it did not previously, and only stacks one file per month. however, it seems to be picking the oldest file instead of the newest where there are duplicates - based on your comments yesterday though, not sure this is worth fixing if CI is already working on a fix for the duplicates.

EDIT, 10/30/20: My error! Not entirely sure how, but the new code hadn't been pulled into my clone of NEON-utilities. Re fetched, merged, and rebuilt - and it works just fine now.

cflorian1 commented 3 years ago

@rfiorella @cklunch

I'll know what the CI plan is for fixing the duplicate files after a 1pm meeting today, I can report back in a couple hours.

cflorian1 commented 3 years ago

@rfiorella @cklunch

Sounds like this is a bug, CI had a working solution when it was tested at the start of reprocessing. I hope it will be an easy fix but don't have any information about timeline yet.

cklunch commented 3 years ago

@rfiorella @cflorian1 These changes are now merged to master. I have 3 more changes to make to the package before I send to CRAN, but they're fairly simple, hoping to knock those out in the next day or two.

cflorian1 commented 3 years ago

@cklunch @rfiorella Great, thanks! FYI this should be resolved on the data portal side pretty soon.

Turns out the h5.gz extension broke the script that removes old versions of a file from the portal because it changed the position of the timestamp in the file name.

rfiorella commented 3 years ago

Thanks @cflorian1!

cklunch commented 3 years ago

@rfiorella @cflorian1 v1.3.8 is at CRAN now.