Closed alawvt closed 3 years ago
Command to add to crontabs:
update metadatavalue set text_lang='en' where text_lang IN ('', 'en_US', 'en US', '*') OR text_lang IS NULL;
Test:
Before SQL command
Check count of items in metadatavalue table
SELECT count(*)
FROM item
INNER JOIN metadatavalue as mv on item.uuid=mv.dspace_object_id
Result: count: 2,343,963
Check counts for values of text_lang, ex. 'en'
SELECT count(*)
FROM item
INNER JOIN metadatavalue as mv on item.uuid=mv.dspace_object_id
where mv.text_lang='en';
Check for unique values of text_lang
SELECT distinct text_lang
FROM "metadatavalue"
Result: text_lang [count]
NULL [310,527]
'' (empty string) [421,582]
en [371927]
es [56]
* [9,053]
en_US [1,229,990]
en US [812]
de [1]
fr [5]
(There were a few other oddball text_lang values that I fixed manually.)
Run the SQL update (took about 10 minutes in LDE)
update metadatavalue set text_lang='en' where text_lang IN ('', 'en_US', 'en US', '*') OR text_lang IS NULL;
UPDATE 3362880
After SQL command
in metadatavalue table:
SELECT count(*)
FROM item
INNER JOIN metadatavalue as mv on item.uuid=mv.dspace_object_id
[2,343,963 rows]
SELECT distinct text_lang
FROM "metadatavalue"
text_lang [count]
en [2,343,891 rows]
Check for other values of text_lang
NULL [0 rows]
'' (empty string) [0 rows]
(And the few other oddball text_lang values that I fixed manually.)
@keithgee, this seems to work ok in the LDE. I'd appreciate it if you and @soumikgh could develop a way to add it to the daily crontab. I've submitted this as Request RITM0051332.
Anne, thanks. I have a few questions.
It looks like the command will update the language code to "en" even for metadata that's in other languages if the language code wasn't previously set in the metadata. I don't know if we have any such items, but if we do, is that an acceptable side effect?
I think the command will also set language codes for values of metadata fields where the language isn't relevant, for example, fields that contain dates. Is this intended?
It looks like when you tested it, it took about ten minutes to run. Are you sure this something that you want to run every day?
@keithgee, than you very much for reviewing this. Yes, this will change the metadata language code to en
for everything not explicitly another language code, even dates and URLs where language is not relevant. We have decided on this to simply our metadata and enable the DSpace REST Query Client.
As we discussed, please time the query for the first run. Then change the metadata for some items and time how long the query takes susequently.
@alawvt, you're correct that it runs much faster on subsequent runs. Here is a timing for the first run from a VM on my new-ish desktop with SSD:
UPDATE 3362880
real 2m8.454s user 0m0.021s sys 0m0.005s
Here is the timing on the second run:
UPDATE 2
real 0m1.381s user 0m0.024s sys 0m0.001s
I really thought it would still be slow after the first run!
@keithgee, thank you for testing this and the good news.
I've created a merge request in the vtlibans_dspace project with changes to deploy a script that gets the database configuration from DSpace and executes the SQL, and put instructions for testing in the merge details. An alternative approach is the (pg_cron tool)[https://github.com/citusdata/pg_cron], but I think the script approach I used may work better for our current deployment practices.
@alawvt Can you or @kdweeks test, and can we close this issue and track from vtlibans_dspace from here on?
Continued in Add metadata language code updater.
Continues #476
In the metadata table, create a SQL query or function to change text_lang from NULL OR '' OR 'en_US' to 'en'.