CDLUC3 / mrt-doc

Documentation and Information regarding the Merritt repository
8 stars 4 forks source link

Bulk correct n2t entries that resolve to merritt-aws.cdlib.org #1782

Open terrywbrady opened 7 months ago

mreyescdl commented 7 months ago

We need to switch the "target" of ARKs that are defined with a domain name of "merritt-aws.cdlib.org" The new target will be "merritt.cdlib.org"

Jing can provide us with a list of ARKs that have the wrong Target, but it will be up to us to make the API calls. The EZID API docs state that we can modify the Target as follows:

curl -u username:password -X POST -H 'Content-Type: text/plain'
  --data-binary '_target: url' https://ezid.cdlib.org/id/identifier
mreyescdl commented 7 months ago

The timeframe where we used a domain name of merritt-aws.cdlib.org is:

   Start: 2021-07-19 16:21:38 
   End: 2023-09-10 04:16:39

There are 1.27M arks created in this timeframe.

I'll create a one-time update script that will be used to iterate through the list. We can then discuss with EZID folks on the update frequency. We do not want to create any issues for them by taxing their service.

terrywbrady commented 7 months ago

I wonder what had us switch to merritt-aws in 2021.

mreyescdl commented 7 months ago

I think this was when we migrated from the original Amazon Linux to Amazon Linux 2. We must have used the merritt-aws.cdlib.org CNAME to facilitate the migration.

When we go to Amazon Linux 2023 we should remember not to change the target for EZID.

terrywbrady commented 7 months ago

I understand that the target url is not an indexed field, but I am wondering if it would be possible to iterate over all records created between 2021-07-19 16:21:38 and 2023-09-10 04:16:39 perhaps in batches of 10K or 50K records and conditionally replace merritt-aws.cdlib.org with merritt.cdlib.org.

While this would require some batch updating in SQL, would this proposed process be less taxing than submitting 1.3M API calls?

sfisher commented 7 months ago

I'm not sure what query interfaces EZID presents to find a target through the API, though I see the target url in the database in at least one table (ezidapp_identifier).

If nothing else, it seems like we could do a LIKE query to find these if it's not a readily available option from public interfaces.

I started looking at it, but stopped my query after it ran for about 10 seconds out of caution. It doesn't seem like that field is full-text indexed in MySQL which probably means it was doing a full scan and I don't want to monopolize the database.

It seems to me there are better solutions than submitting 1.3M api calls. If this is indeed the correct table, worst case, I could probably dump it to a local mysql, full-text index the field and then run the LIKE query which I assume could run quickly then.

Jing may know better ways since I'm not all the familiar with EZID APIs and internals at this point.

terrywbrady commented 7 months ago

@sfisher , I have sometimes done things like this... iterating the start/end numbers with a script.

It can be tedious but it can prevent stressing out the database.

update table x
set foo='bar'
where foo like '%meow%'
and id >= 1001000 and id < 1002000
sfisher commented 7 months ago

@terrywbrady I emailed you a CSV list that I extracted from a table. I'll double-check with Jing to be sure it's the correct place to look and that I didn't miss anything from that table or by limiting to the Merritt account or that string. I believe it is ok.

datadavev commented 7 months ago

@sfisher - what was the count of entries you have with target like http://merritt-aws.cdlib.org? A recent dump counts 1222050.

Note that with the current connection to n2t, it is necessary to use the API to perform the target update. Direct edits to the database would be risky. Unfortunately the update process would be pretty slow - with that number of entries and a script throttled to about 10/sec, you're looking at about 1.5 days.

jsjiang commented 7 months ago

@terrywbrady Thank you for providing the time frame for affected records. I assume these records are all under the Merritt user account, right? The target url field is not indexed so querying the database to get the IDs is a little bit tricky. As you suggested, we can try small batched based on created time frame and user id. Once we get the list of IDs, updating the target urls still need to go through API to trigger updates to all related tables/fields.

Jing

terrywbrady commented 7 months ago

@sfisher , @jsjiang , @datadavev , thanks for the considering this. We'll plan to move forward with the API calls.

I presume we know the list of records that likely need a change (based on creation date on our end). Jing, these would all be from the Merritt account.

We'll likely schedule the work to batch up API calls after the Dryad data has been removed from Merritt. We'll plan to alert you all when we schedule the work.

@mreyescdl and @elopatin-uc3 , fyi.

sfisher commented 7 months ago

@terrywbrady I've updated and emailed you a list in CSV.

There are 1,085,951 under the merritt account with merritt-aws.cdlib.org in the target.

jsjiang commented 7 months ago

We will not update or save UC ARKs in N2T after the upcoming N2T migration. How about we hold off this batch update to after the N2T migration has completed to avoid unnecessary network and processing activities.

mreyescdl commented 7 months ago

I'm removing Dryad objects from list provided by Scott. The Dryad ARKs will be tombstoned soon so there is no need to update Target.

mreyescdl commented 3 months ago

The scripts to modify the Target in EZID is set up on the production Batch host (uc3-mrtbatch01x2-prd.cdlib.org) If this host is upgraded to AL2023 before executing, then we must ensure that we copy over the /dpr2/tools directory to the new host.

The scripts are located under the dpr2 account at /dpr2/tools/ezid_target/ezid_target/production

mreyescdl commented 1 month ago

I will start batch processing of 1M+ requests after production patching on 8/6. At a submission rate of 1 request every 2 seconds, it will take about 25 days to complete

mreyescdl commented 3 weeks ago

@elopatin-uc3 Existing EZID records that we are to modify are double-encoded in the ark. To confirm, we want to NOT have the arks encoded??

IS _target: http://merritt-aws.cdlib.org/m/ark%253A%252F13030%252Fj22f7609

WANT _target: http://merritt.cdlib.org/m/ark:/13030/j22f7609

elopatin-uc3 commented 3 weeks ago

Talked about this during the team meeting and decided to go with single encoding (removing encoding doesn't work).

mreyescdl commented 3 weeks ago

After testing, Merritt only supports single or double encoded ARKs in access URL I'll provide single to EZID

mreyescdl commented 3 weeks ago

Update: ARKs needs to be doubly encoded. If single, then EZID "swallows" the encoding and the target is not encoded. Double encoded works

Started the processing

mreyescdl commented 4 days ago

Stopped for patching. Resubmitted 63 Gateway error objects

Added a pause feature to script.

730K left to process