UCHIC / iUTAHData

The iUTAH Modeling and Data Federation website - data.iutahepscor.org.
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Duplicate files at several sites #83

Closed AmberSJones closed 7 years ago

AmberSJones commented 7 years ago

I have noticed that duplicate files have been uploaded to both ckan and HydroShare for several sites. The files are typically for 2014 or 2015. When I open the files, it appears that they are the same, but some of them don't have the same number of values, etc. I haven't had a chance to look closely to try to figure out any issues. The sites that I have noticed are all in Logan River: LR_FB_C LR_GC_C LR_MainStreet_BA LR_TG_C LR_Mendon_AA

(I started removing the apparent duplicates from a few of the landing pages before I realized how widespread it was, so they might not all show up). image

fryarludwig commented 7 years ago

Looking into this. The issue may be with an older version of the script being run on someone else's computer, probably as part of testing. The logs on the 'official' version of the uploading script show that no extra files were created on this latest upload, and there are no duplicate files that resemble what we're seeing on CKAN/HydroShare.

As a side note, I can have the script remove the duplicates from HS automatically, but that functionality isn't built into the CKAN side of the tool.

I'll let you know when I find out more.

fryarludwig commented 7 years ago

Duplicate files on HydroShare were removed. HydroShare's duplicate-file naming scheme is FileName_random_chars.Extension, which accounted for about half the duplicates. The other half were Windows (Mac too?) style, with a sequential number in parenthesis. I've also updated the code in iUtahUtilities and merged with Master, so we should make sure we only use the newest version of that when mass-uploading iUtah GAMUT's raw data.

CKAN is still in the current state, and makes a bit less sense to me. The duplicate files show they were last updated in Jan. 2015. I can create an automated way to remove the duplicates if you want, or they can be removed manually.

AmberSJones commented 7 years ago

I'm not too worried about ckan, but I want to make sure that the HydroShare side of things is clean and correct. Do you have any sort of check enabled to make sure that this doesn't crop up again?

I'll close the issue.

fryarludwig commented 7 years ago

I pushed a version to master that removes duplicates of each file it uploads to HydroShare by default. This won't completely fix the issue if it crops up again, but it'll make sure the latest versions of our data never have duplicates.