VTUL / vtechworks

DSpace at Virginia Tech
http://vtechworks.lib.vt.edu
Other
6 stars 8 forks source link

Create a SQL query or function to change text_lang to 'en' #737

Closed alawvt closed 3 years ago

alawvt commented 3 years ago

Continues #476

In the metadata table, create a SQL query or function to change text_lang from NULL OR '' OR 'en_US' to 'en'.

alawvt commented 3 years ago

Command to add to crontabs:

update metadatavalue set text_lang='en' where text_lang IN ('', 'en_US', 'en US', '*') OR text_lang IS NULL;

Test:

alawvt commented 3 years ago

Before SQL command

Check count of items in metadatavalue table

SELECT count(*)
FROM item
INNER JOIN metadatavalue as mv on item.uuid=mv.dspace_object_id

Result: count: 2,343,963

Check counts for values of text_lang, ex. 'en'

SELECT count(*)
FROM item
INNER JOIN metadatavalue as mv on item.uuid=mv.dspace_object_id
where mv.text_lang='en';

Check for unique values of text_lang

SELECT distinct text_lang
FROM "metadatavalue"

Result: text_lang [count]

NULL    [310,527]
'' (empty string) [421,582]
en    [371927]
es    [56]
*    [9,053]
en_US    [1,229,990]
en US    [812]
de    [1]
fr    [5]

(There were a few other oddball text_lang values that I fixed manually.)

Run the SQL update (took about 10 minutes in LDE)

update metadatavalue set text_lang='en' where text_lang IN ('', 'en_US', 'en US', '*') OR text_lang IS NULL;

UPDATE 3362880

After SQL command

in metadatavalue table:

SELECT count(*)
FROM item
INNER JOIN metadatavalue as mv on item.uuid=mv.dspace_object_id

[2,343,963 rows]
SELECT distinct text_lang
FROM "metadatavalue"

text_lang [count]

en    [2,343,891 rows]

Check for other values of text_lang

NULL    [0 rows]
'' (empty string) [0 rows]
(And the few other oddball text_lang values that I fixed manually.)
alawvt commented 3 years ago

@keithgee, this seems to work ok in the LDE. I'd appreciate it if you and @soumikgh could develop a way to add it to the daily crontab. I've submitted this as Request RITM0051332.

keithgee commented 3 years ago

Anne, thanks. I have a few questions.

alawvt commented 3 years ago

@keithgee, than you very much for reviewing this. Yes, this will change the metadata language code to en for everything not explicitly another language code, even dates and URLs where language is not relevant. We have decided on this to simply our metadata and enable the DSpace REST Query Client.

As we discussed, please time the query for the first run. Then change the metadata for some items and time how long the query takes susequently.

keithgee commented 3 years ago

@alawvt, you're correct that it runs much faster on subsequent runs. Here is a timing for the first run from a VM on my new-ish desktop with SSD:

UPDATE 3362880

real 2m8.454s user 0m0.021s sys 0m0.005s

Here is the timing on the second run:

UPDATE 2

real 0m1.381s user 0m0.024s sys 0m0.001s

I really thought it would still be slow after the first run!

alawvt commented 3 years ago

@keithgee, thank you for testing this and the good news.

keithgee commented 3 years ago

I've created a merge request in the vtlibans_dspace project with changes to deploy a script that gets the database configuration from DSpace and executes the SQL, and put instructions for testing in the merge details. An alternative approach is the (pg_cron tool)[https://github.com/citusdata/pg_cron], but I think the script approach I used may work better for our current deployment practices.

@alawvt Can you or @kdweeks test, and can we close this issue and track from vtlibans_dspace from here on?

alawvt commented 3 years ago

Continued in Add metadata language code updater.