Operationalize MDC: Create Cron Jobs, Acquire, Configure Prod Web Token, Handle Logs

IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34

5 stars 1 forks source link

Operationalize MDC: Create Cron Jobs, Acquire, Configure Prod Web Token, Handle Logs #3

Open kcondon opened 5 years ago

kcondon commented 5 years ago

The MDC feature is well documented but there are a few items that need to be addressed to operate in a production environment: -Create Cron job(s) to cal the various API endpoints needed to process various files, import to db, including error detection and notification of failure -Acquire and configure a production web token that allows publishing stats to DataCite -Consider/Plan/Monitor growth of log files -Consider how to troubleshoot or rerun failed jobs.

djbrooke commented 5 years ago

I'll add one more...

we'll need to figure out how to handle pre-MDC download counts. I'd like to reflect them so that researchers don't need to start at zero. :)

@dlowenberg @mfenner it would be good to get some thoughts from you and the team on how other groups have handled this. Thanks in advance for any guidance or for pointing us to any docs!

dlowenberg commented 5 years ago

Hi there, if you would like to look at or copy the code that we wrote for Dryad in processing the last ten years of downloads, here is some info that may be useful:

The main reporting code is here: https://github.com/datadryad/dryad-repo/blob/dryad-master/dspace/modules/api/src/main/java/org/dspace/curate/DashStats.java Though it’s pretty specific to the existing Dryad setup. It writes out a text file that is formatted for the counter-processor, but it’s sorted by dataset. Then there is a script that re-sorts everything based on time: https://github.com/datadryad/dryad-utils/blob/master/dash-migration/sort_dash_stats.sh

Happy to set up time for you to talk with Ryan Scherle (Dryad) if that would be helpful. Otherwise, the DataCite and DataONE folks may also have some tips.

djbrooke commented 5 years ago

Thanks @dlowenberg! I'll check in with the team here and we'll get back with you if we feel a discussion with Dryad is needed. Thanks again!

P.S. I just pinged you on another issue in the main Dataverse repo: https://github.com/IQSS/dataverse/issues/5957

djbrooke commented 5 years ago

We should do the things outlined in the original comment and other items not yet identified
The current suggestion is to seed the count with the downloads that already exist, but we can discuss during the sprint.
We should make note for users about how the numbers are derived (some from before the standard was implemented and others from after)

djbrooke commented 5 years ago

I picked this up out of the sprint column today to begin stubbing out documentation regarding migrating counts and other things that installations will need to know to use Make Data Count in production, but I don't have the bandwidth this week. I will re-visit early next week.

pdurbin commented 5 years ago

@djbrooke if you're stubbing out documentation, you might want to create a branch for https://github.com/IQSS/dataverse/issues/6082 which was just opened. The issue title is "Documentation: Some tweaks to Make Data Count doc based on recent experience".

jggautier commented 1 year ago

See https://github.com/IQSS/dataverse.harvard.edu/issues/75#issuecomment-1263534870

cmbz commented 1 year ago

I moved this issue into the Global Backlog in the NIH Backlog column, as per conversations with @siacus and current AIM 5 Year 2 plans.

cmbz commented 1 year ago

Note that because of the quantity of backlog logs for HDV, the processing of these logs might be lengthy.

qqmyers commented 6 months ago

FYI: Be aware that beyond the setup instructions in the guides, there is a one-line bug fix needed in counter processor to make MDC reporting to DataCite work. See https://github.com/IQSS/dataverse/issues/10334.

landreev commented 6 months ago

One random thought: one of the prod. servers, dvn-cloud-rserv-1.lib.harvard.edu is currently underutilized, and could be a prime candidate for running that processor on the accumulated logs.

pdurbin commented 6 months ago

Heads up that Counter Processor was archived yesterday:

https://github.com/CDLUC3/counter-processor/pull/32

pdurbin commented 5 months ago

As discussed at standup, I forked the repo:

https://github.com/IQSS/dataverse/issues/10406

sbarbosadataverse commented 5 months ago

https://github.com/IQSS/dataverse-pm/issues/218

pdurbin commented 1 month ago

This was accidentally and automatically closed when https://github.com/IQSS/dataverse/pull/10424 was merged. Re-opening.