NYPL-Simplified / server_core

Shared data model and utilities for Library Simplified server applications
7 stars 11 forks source link

Add migration script to replace insecure http://book-covers.nypl.org cover image URLs with secure https://covers.nypl.org URLs #1224

Closed leonardr closed 3 years ago

leonardr commented 3 years ago

This branch fixes https://jira.nypl.org/browse/SIMPLY-3349 with a migration script that replaces old book cover URLs obtained from the metadata wrangler with their new equivalents. This became more urgent as web-based catalogs started hitting circulation managers and throwing mixed-content warnings due to insecure book covers.

In a separate track, I'm running a data fixing script on the metadata wrangler itself so that it no longer serves the old URLs. It should finish running in a day or so.

To try this out, pick a circulation manager that you know has insecure book cover URLs and select a sample of those URLs:

select replace(url, 'http://book-covers.nypl.org', 'https://covers.nypl.org') from representations where url like 'http://book-covers.nypl.org%' limit 10

Verify that the replaced versions of the URLs work -- i.e. we successfully copied everything over from the old S3 bucket to the new one back in 2018.

After that I think it's just a matter of trying out the migration script and seeing if there's a SQL error. If you run the script a second time, it shouldn't do anything, because you've replaced all the insecure URLs.

leonardr commented 3 years ago

I've (finally) completed the metadata wrangler data fix. @tdilauro can you try out this migration script and see if it works for you?

tdilauro commented 3 years ago

select replace(url, 'http://book-covers.nypl.org', 'https://covers.nypl.org') from representations where url like 'http://book-covers.nypl.org%' limit 10

I did get an error from one of the sampled URLs, but otherwise fine:

The image “https://covers.nypl.org/Overdrive/Overdrive%20ID/f33d2829-351a-42b9-8a8a-740de549ef96/%257BF33D2829-351A-42B9-8A8A-740DE549EF96%257DImg100.jpg” cannot be displayed because it contains errors.
leonardr commented 3 years ago

https://covers.nypl.org/Overdrive/Overdrive%20ID/f33d2829-351a-42b9-8a8a-740de549ef96/%257BF33D2829-351A-42B9-8A8A-740DE549EF96%257DImg100.jpg gives a 403 error.

I've reverted the migration script until we can go through the files on covers.nypl.org and change their permissions. I'm hoping this is just a minor issue on a few of the older images, but at the moment I don't actually know of any easy way to do this operation.

tdilauro commented 3 years ago

I've (finally) completed the metadata wrangler data fix. @tdilauro can you try out this migration script and see if it works for you?

I ran into problems with the migration. We have a bunch of the new format URLs and there is apparently some overlap. This from a copy of one of our CMs:

SELECT count(*)
FROM representations old
JOIN representations new ON new.url = replace(old.url, 'http://book-covers.nypl.org', 'https://covers.nypl.org')
WHERE old.url LIKE 'http://book-covers.nypl.org%' 
  AND RIGHT(old.url, 50) = RIGHT(new.url, 50)

10619

Here is the output from the migration itself:

./core/bin/migrate_database    
/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/OpenSSL/crypto.py:12: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release.
  from cryptography import x509
2 new migrations found.
  - 20210121-self-hosted-is-required.sql
  - 20210201-no-more-http-image-urls.sql
Database Migration Timestamp stamped at 2021-01-21 for 20210121-self-hosted-is-required.sql

ERROR: Migration has been halted.
/Users/Shared/src/lyrasis/circulation/core/migration/20210201-no-more-http-image-urls.sql must be migrated manually.
==================================================
Traceback (most recent call last):
  File "/Users/Shared/src/lyrasis/circulation/core/scripts.py", line 2419, in run_migrations
    self._run_migration(full_migration_path, timestamp)
  File "/Users/Shared/src/lyrasis/circulation/core/scripts.py", line 2459, in _run_migration
    self._db.execute(sql)
  File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1292, in execute
    clause, params or {}
  File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
    return meth(self, multiparams, params)
  File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1130, in _execute_clauseelement
    distilled_params,
  File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1317, in _execute_context
    e, statement, parameters, cursor, context
  File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1511, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
    cursor, statement, parameters, context
  File "/Users/timmo/.PyEnvs/circulation-nQEl9FH5/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 593, in do_execute
    cursor.execute(statement, parameters)
IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "representations_url_media_type_key"
DETAIL:  Key (url, media_type)=(https://covers.nypl.org/Overdrive/Overdrive%20ID/600c62ce-b04b-498c-b060-7bc2e1713efc/%257B600C62CE-B04B-498C-B060-7BC2E1713EFC%257DImg100.jpg, image/jpeg) already exists.

[SQL: BEGIN;
-- If there are any book covers with the old, insecure URLs, we need
-- to regenerate the cached OPDS entry for every book. This will take
-- more time than running a query to locate the books that most likely
-- need to have their entries regenerated, but it will happen in the
-- background, later on, not while the site is down for a software upgrade.
--
-- The only optimization is that if there are _no_ book covers with
-- these insecure URLs, we don't need to regenerate OPDS entries at
-- all. This should be the case for most recently installed circulation
-- managers.
delete from workcoveragerecords where operation='generate-opds' and exists(select id from representations where url like 'http://book-covers.nypl.org%%' limit 1);

-- Update the underlying representations used to build the OPDS entries,
-- changing insecure URLs to secure URLs.
--
-- Again, if there are no book covers with insecure URLs, don't bother.
update representations set url=replace(url, 'http://book-covers.nypl.org', 'https://covers.nypl.org') where exists(select id from representations where url like 'http://book-covers.nypl.org%%' limit 1) and id in (select id from representations where url like 'http://book-covers.nypl.org%%');

COMMIT;]
(Background on this error at: http://sqlalche.me/e/13/gkpj)
None