Open terrywbrady opened 7 months ago
The timeframe where we used a domain name of merritt-aws.cdlib.org is:
Start: 2021-07-19 16:21:38
End: 2023-09-10 04:16:39
There are 1.27M arks created in this timeframe.
I'll create a one-time update script that will be used to iterate through the list. We can then discuss with EZID folks on the update frequency. We do not want to create any issues for them by taxing their service.
I wonder what had us switch to merritt-aws in 2021.
I think this was when we migrated from the original Amazon Linux to Amazon Linux 2. We must have used the merritt-aws.cdlib.org CNAME to facilitate the migration.
When we go to Amazon Linux 2023 we should remember not to change the target for EZID.
I understand that the target url is not an indexed field, but I am wondering if it would be possible to iterate over all records created between 2021-07-19 16:21:38
and 2023-09-10 04:16:39
perhaps in batches of 10K or 50K records and conditionally replace merritt-aws.cdlib.org
with merritt.cdlib.org
.
While this would require some batch updating in SQL, would this proposed process be less taxing than submitting 1.3M API calls?
I'm not sure what query interfaces EZID presents to find a target through the API, though I see the target url in the database in at least one table (ezidapp_identifier
).
If nothing else, it seems like we could do a LIKE query to find these if it's not a readily available option from public interfaces.
I started looking at it, but stopped my query after it ran for about 10 seconds out of caution. It doesn't seem like that field is full-text indexed in MySQL which probably means it was doing a full scan and I don't want to monopolize the database.
It seems to me there are better solutions than submitting 1.3M api calls. If this is indeed the correct table, worst case, I could probably dump it to a local mysql, full-text index the field and then run the LIKE query which I assume could run quickly then.
Jing may know better ways since I'm not all the familiar with EZID APIs and internals at this point.
@sfisher , I have sometimes done things like this... iterating the start/end numbers with a script.
It can be tedious but it can prevent stressing out the database.
update table x
set foo='bar'
where foo like '%meow%'
and id >= 1001000 and id < 1002000
@terrywbrady I emailed you a CSV list that I extracted from a table. I'll double-check with Jing to be sure it's the correct place to look and that I didn't miss anything from that table or by limiting to the Merritt account or that string. I believe it is ok.
@sfisher - what was the count of entries you have with target like http://merritt-aws.cdlib.org
? A recent dump counts 1222050.
Note that with the current connection to n2t, it is necessary to use the API to perform the target update. Direct edits to the database would be risky. Unfortunately the update process would be pretty slow - with that number of entries and a script throttled to about 10/sec, you're looking at about 1.5 days.
@terrywbrady Thank you for providing the time frame for affected records. I assume these records are all under the Merritt user account, right? The target url field is not indexed so querying the database to get the IDs is a little bit tricky. As you suggested, we can try small batched based on created time frame and user id. Once we get the list of IDs, updating the target urls still need to go through API to trigger updates to all related tables/fields.
Jing
@sfisher , @jsjiang , @datadavev , thanks for the considering this. We'll plan to move forward with the API calls.
I presume we know the list of records that likely need a change (based on creation date on our end). Jing, these would all be from the Merritt account.
We'll likely schedule the work to batch up API calls after the Dryad data has been removed from Merritt. We'll plan to alert you all when we schedule the work.
@mreyescdl and @elopatin-uc3 , fyi.
@terrywbrady I've updated and emailed you a list in CSV.
There are 1,085,951 under the merritt account with merritt-aws.cdlib.org in the target.
We will not update or save UC ARKs in N2T after the upcoming N2T migration. How about we hold off this batch update to after the N2T migration has completed to avoid unnecessary network and processing activities.
I'm removing Dryad objects from list provided by Scott. The Dryad ARKs will be tombstoned soon so there is no need to update Target.
The scripts to modify the Target in EZID is set up on the production Batch host (uc3-mrtbatch01x2-prd.cdlib.org)
If this host is upgraded to AL2023 before executing, then we must ensure that we copy over the /dpr2/tools
directory to the new host.
The scripts are located under the dpr2
account at /dpr2/tools/ezid_target/ezid_target/production
I will start batch processing of 1M+ requests after production patching on 8/6. At a submission rate of 1 request every 2 seconds, it will take about 25 days to complete
@elopatin-uc3 Existing EZID records that we are to modify are double-encoded in the ark. To confirm, we want to NOT have the arks encoded??
IS _target: http://merritt-aws.cdlib.org/m/ark%253A%252F13030%252Fj22f7609
WANT _target: http://merritt.cdlib.org/m/ark:/13030/j22f7609
Talked about this during the team meeting and decided to go with single encoding (removing encoding doesn't work).
After testing, Merritt only supports single or double encoded ARKs in access URL I'll provide single to EZID
Update: ARKs needs to be doubly encoded. If single, then EZID "swallows" the encoding and the target is not encoded. Double encoded works
Started the processing
Stopped for patching. Resubmitted 63 Gateway error objects
Added a pause feature to script.
730K left to process
We need to switch the "target" of ARKs that are defined with a domain name of "merritt-aws.cdlib.org" The new target will be "merritt.cdlib.org"
Jing can provide us with a list of ARKs that have the wrong Target, but it will be up to us to make the API calls. The EZID API docs state that we can modify the Target as follows: