Make the OPDS importer update acquisition links

vbessonov commented 3 years ago

Description

This PR fixes the bug related to the OPDS importer not updating acquisition.

I don't think that adding LicensePool. reset_open_access_download_url method is a good solution, maybe it's better to create a setter for LicensePool.best_open_access_link property.

Motivation and Context

Steps to reproduce:

Import an OPDS feed using the OPDS importer.
Update acquisition links.
Reimport the feed.
Notice that the links were not updated.

How Has This Been Tested?

@tdilauro, I used the following steps for testing:

Restore the database dump:

pg_restore -d simplified_circulation_dump local-db-simplified_circ_db.dump

Create a mock service using SoapUI served at http://localhost:8080/test which returns the same feed except Album de grilles en fer creux et fer massif book where:
- I updated all acquisition links by adding 1234567890 suffix to the their end.
- I set <updated> tag's value to the current date and time in UTC.

Update the URL of the OPDS collection:

update collections set external_account_id = 'http://localhost:8080/test' where id = 5;

Run opds_import_monitor script to reimport the updated book. Please note that the whole page where Album de grilles en fer creux et fer massif is going to be reimported (this is by design).
Recreate the search index. Please note that the actions described above are required only because the local search index doesn't have information about all the books from the dump which is not the case in production.
- Drop the current index:
```
curl --location --request DELETE 'localhost:9200/_all'
```
- Truncate all the work coverage records:
```
truncate workcoveragerecords;
```
- Refresh the search index by running search_index_refresh.
- Drop the cached feeds:
```
truncate cachedfeeds;
```
Open the admin UI. Because I don't know the admin's password I had to reset it by running the following query that sets the admin's password to servers@lyrasistechnology.org:
```
update admins set password_hashed = '$2a$12$sdOOruVaDzgVqVF6K4zp5.dyYtcykzIlGpZEA.m8RL3nJHaMPTswu' where id = 19;
```
Create a basic authentication provider. It's required for the CM's feed to work correctly, otherwise you won't be able to checkout books.
Find Album de grilles en fer creux et fer massif book and check it out.
Try to download the book and check in the logs or using the debugger that the fulfilment link actually contains 1234567890 suffix at the end.

Also I just found out that the change I made doesn't fully fix the issue related to cover links not being updated. In the case of this particular feed, they remain the same. You can confirm this by looking at the loans feed at http://localhost:6500/190150/loans after executing step 8.

It looks like that there are a couple of more changes required to fix the cover links issue:

Remove Resource records corresponding to Hyperlink records being deleted in Metadata.apply that can be implemented by adding Hyperlink.delete and Resource.delete methods in the same way as LicensePoolDeliveryMechanism.delete.
Erase Edition.cover_full_url and Edition.cover_thumbnail_url in Metadata.apply if ReplacementPolicy.links is True.
Reinitialize Edition.cover_full_url and Edition.cover_thumbnail_url using "new" links specified in Metadata object. Unfortunately, I failed when I tried to implement all those steps and I think it can be implemented in a separate PR once this one is merged.

Checklist:

[ ] I have updated the documentation accordingly.
[x] All new and existing tests passed.

tdilauro commented 3 years ago

I verified this roughly based on the steps above. Got a redirect to the correct to the expected resource, depending on what I had in the acquisition link.

I can also confirm that the cover links are NOT updating. Given your comments above and the fact that you changed the title of this PR, I suspect that you ar changing the scope of this PR. If so, it would be appropriate to change the scope in the description above, as well.

vbessonov commented 3 years ago

@tdilauro, I updated the description. If required, I can also create a new branch to avoid having cover-links in the branch name.

tdilauro commented 3 years ago

@tdilauro, I updated the description. If required, I can also create a new branch to avoid having cover-links in the branch name.

It would probably be good to get "covers" out of the branch name (for anyone perusing the log). However, I think pushing a new branch name effectively deletes the current branch and adds a new one, which would close this PR. @leonardr or another admin for this repo might be able to rename this branch directly in GH, though.

That said, there's not much discussion to transfer to another PR, so I don't think it would be terrible to supersede this PR.

vbessonov commented 3 years ago

@tdilauro, I created a SQL script for fixing this issue by updating the database:

# Delete all the hyperlinks related to the specific data source.
# It will force CM to use newly imported links.
delete
from
  hyperlinks as h
using
  resources as r
where
  h.resource_id = r.id
  and r.data_source_id = 47
;

# Swipe out fulfillment info because the URL of the acquisition link has changed and it has to be updated.
update loans as l
set
  fulfillment_id = null
from licensepools as lp
where
  l.license_pool_id = lp.id
  and lp.data_source_id = 47
;

# Delete fulfillment objects because the URL of the acquisition link has 
# changed and they have to be updated
delete from licensepooldeliveries
where
  data_source_id = 47
;

# Reset open_access_download_url so because the URL of the acquisition link has changed and it has to be updated.
update licensepools
set
  open_access_download_url = null
where
  data_source_id = 47
;

# Reset cover URLs to force them to be updated during the next import.
update editions
set
  cover_full_url = null,
  cover_thumbnail_url = null
where
  data_source_id = 47
;

# Make sure that the new cover and acquisition links have been updated.
select
  i.id,
  h.data_source_id,
  h.rel,
  r.url
from identifiers as i
inner join hyperlinks as h on i.id = h.identifier_id
inner join resources as r on h.resource_id = r.id
where
  i.identifier = 'urn:x-internet-archive:ebooks-app:item:albumdegrillesen00gand'
;

After you run the script, you can reimport the feed by running opds_import_monitor using --force flag.

@leonardr, do you think it's a valid approach?

NYPL-Simplified / server_core