Closed leonardr closed 3 years ago
I've (finally) completed the metadata wrangler data fix. @tdilauro can you try out this migration script and see if it works for you?
select replace(url, 'http://book-covers.nypl.org', 'https://covers.nypl.org') from representations where url like 'http://book-covers.nypl.org%' limit 10
I did get an error from one of the sampled URLs, but otherwise fine:
The image “https://covers.nypl.org/Overdrive/Overdrive%20ID/f33d2829-351a-42b9-8a8a-740de549ef96/%257BF33D2829-351A-42B9-8A8A-740DE549EF96%257DImg100.jpg” cannot be displayed because it contains errors.
https://covers.nypl.org/Overdrive/Overdrive%20ID/f33d2829-351a-42b9-8a8a-740de549ef96/%257BF33D2829-351A-42B9-8A8A-740DE549EF96%257DImg100.jpg gives a 403 error.
I've reverted the migration script until we can go through the files on covers.nypl.org and change their permissions. I'm hoping this is just a minor issue on a few of the older images, but at the moment I don't actually know of any easy way to do this operation.
I've (finally) completed the metadata wrangler data fix. @tdilauro can you try out this migration script and see if it works for you?
I ran into problems with the migration. We have a bunch of the new format URLs and there is apparently some overlap. This from a copy of one of our CMs:
SELECT count(*)
FROM representations old
JOIN representations new ON new.url = replace(old.url, 'http://book-covers.nypl.org', 'https://covers.nypl.org')
WHERE old.url LIKE 'http://book-covers.nypl.org%'
AND RIGHT(old.url, 50) = RIGHT(new.url, 50)
10619
Here is the output from the migration itself:
./core/bin/migrate_database
/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/OpenSSL/crypto.py:12: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release.
from cryptography import x509
2 new migrations found.
- 20210121-self-hosted-is-required.sql
- 20210201-no-more-http-image-urls.sql
Database Migration Timestamp stamped at 2021-01-21 for 20210121-self-hosted-is-required.sql
ERROR: Migration has been halted.
/Users/Shared/src/lyrasis/circulation/core/migration/20210201-no-more-http-image-urls.sql must be migrated manually.
==================================================
Traceback (most recent call last):
File "/Users/Shared/src/lyrasis/circulation/core/scripts.py", line 2419, in run_migrations
self._run_migration(full_migration_path, timestamp)
File "/Users/Shared/src/lyrasis/circulation/core/scripts.py", line 2459, in _run_migration
self._db.execute(sql)
File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1292, in execute
clause, params or {}
File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
return meth(self, multiparams, params)
File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1130, in _execute_clauseelement
distilled_params,
File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1317, in _execute_context
e, statement, parameters, cursor, context
File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1511, in _handle_dbapi_exception
sqlalchemy_exception, with_traceback=exc_info[2], from_=e
File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
cursor, statement, parameters, context
File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 593, in do_execute
cursor.execute(statement, parameters)
IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "representations_url_media_type_key"
DETAIL: Key (url, media_type)=(https://covers.nypl.org/Overdrive/Overdrive%20ID/600c62ce-b04b-498c-b060-7bc2e1713efc/%257B600C62CE-B04B-498C-B060-7BC2E1713EFC%257DImg100.jpg, image/jpeg) already exists.
[SQL: BEGIN;
-- If there are any book covers with the old, insecure URLs, we need
-- to regenerate the cached OPDS entry for every book. This will take
-- more time than running a query to locate the books that most likely
-- need to have their entries regenerated, but it will happen in the
-- background, later on, not while the site is down for a software upgrade.
--
-- The only optimization is that if there are _no_ book covers with
-- these insecure URLs, we don't need to regenerate OPDS entries at
-- all. This should be the case for most recently installed circulation
-- managers.
delete from workcoveragerecords where operation='generate-opds' and exists(select id from representations where url like 'http://book-covers.nypl.org%%' limit 1);
-- Update the underlying representations used to build the OPDS entries,
-- changing insecure URLs to secure URLs.
--
-- Again, if there are no book covers with insecure URLs, don't bother.
update representations set url=replace(url, 'http://book-covers.nypl.org', 'https://covers.nypl.org') where exists(select id from representations where url like 'http://book-covers.nypl.org%%' limit 1) and id in (select id from representations where url like 'http://book-covers.nypl.org%%');
COMMIT;]
(Background on this error at: http://sqlalche.me/e/13/gkpj)
None
This branch fixes https://jira.nypl.org/browse/SIMPLY-3349 with a migration script that replaces old book cover URLs obtained from the metadata wrangler with their new equivalents. This became more urgent as web-based catalogs started hitting circulation managers and throwing mixed-content warnings due to insecure book covers.
In a separate track, I'm running a data fixing script on the metadata wrangler itself so that it no longer serves the old URLs. It should finish running in a day or so.
To try this out, pick a circulation manager that you know has insecure book cover URLs and select a sample of those URLs:
select replace(url, 'http://book-covers.nypl.org', 'https://covers.nypl.org') from representations where url like 'http://book-covers.nypl.org%' limit 10
Verify that the replaced versions of the URLs work -- i.e. we successfully copied everything over from the old S3 bucket to the new one back in 2018.
After that I think it's just a matter of trying out the migration script and seeing if there's a SQL error. If you run the script a second time, it shouldn't do anything, because you've replaced all the insecure URLs.