Open kcondon opened 5 years ago
I'll add one more...
@dlowenberg @mfenner it would be good to get some thoughts from you and the team on how other groups have handled this. Thanks in advance for any guidance or for pointing us to any docs!
Hi there, if you would like to look at or copy the code that we wrote for Dryad in processing the last ten years of downloads, here is some info that may be useful:
The main reporting code is here: https://github.com/datadryad/dryad-repo/blob/dryad-master/dspace/modules/api/src/main/java/org/dspace/curate/DashStats.java Though it’s pretty specific to the existing Dryad setup. It writes out a text file that is formatted for the counter-processor, but it’s sorted by dataset. Then there is a script that re-sorts everything based on time: https://github.com/datadryad/dryad-utils/blob/master/dash-migration/sort_dash_stats.sh
Happy to set up time for you to talk with Ryan Scherle (Dryad) if that would be helpful. Otherwise, the DataCite and DataONE folks may also have some tips.
Thanks @dlowenberg! I'll check in with the team here and we'll get back with you if we feel a discussion with Dryad is needed. Thanks again!
P.S. I just pinged you on another issue in the main Dataverse repo: https://github.com/IQSS/dataverse/issues/5957
I picked this up out of the sprint column today to begin stubbing out documentation regarding migrating counts and other things that installations will need to know to use Make Data Count in production, but I don't have the bandwidth this week. I will re-visit early next week.
@djbrooke if you're stubbing out documentation, you might want to create a branch for https://github.com/IQSS/dataverse/issues/6082 which was just opened. The issue title is "Documentation: Some tweaks to Make Data Count doc based on recent experience".
Note that because of the quantity of backlog logs for HDV, the processing of these logs might be lengthy.
FYI: Be aware that beyond the setup instructions in the guides, there is a one-line bug fix needed in counter processor to make MDC reporting to DataCite work. See https://github.com/IQSS/dataverse/issues/10334.
One random thought: one of the prod. servers, dvn-cloud-rserv-1.lib.harvard.edu
is currently underutilized, and could be a prime candidate for running that processor on the accumulated logs.
Heads up that Counter Processor was archived yesterday:
As discussed at standup, I forked the repo:
This was accidentally and automatically closed when https://github.com/IQSS/dataverse/pull/10424 was merged. Re-opening.
The MDC feature is well documented but there are a few items that need to be addressed to operate in a production environment: -Create Cron job(s) to cal the various API endpoints needed to process various files, import to db, including error detection and notification of failure -Acquire and configure a production web token that allows publishing stats to DataCite -Consider/Plan/Monitor growth of log files -Consider how to troubleshoot or rerun failed jobs.