alphagov / asset-manager

Manages uploaded assets (images, PDFs etc.) for applications on GOV.UK
https://docs.publishing.service.gov.uk/apps/asset-manager.html
MIT License
9 stars 9 forks source link

Investigate discrepancy between assets stored in S3 vs NFS #301

Closed chrisroos closed 6 years ago

chrisroos commented 7 years ago

I noticed some discrepancies in the number of assets while checking whether the overnight sync of assets worked (as part of issue #145).

We should work out why the number of files on S3 doesn't match the number of files on NFS.

Assets on S3

# List bucket content to file
$ aws s3 ls s3://govuk-assets-integration/ > integration-assets.txt

# Count of the number of objects in the bucket
$ wc -l integration-assets.txt 
   64989 integration-assets.txt

Assets in the database on integration

> Asset.count
=> 63567
irb(main):002:0> Asset.unscoped.count
=> 65431

Assets on NFS in integration

$ find /data/uploads/asset-manager/assets/ -type f | wc -l
65431
floehopper commented 6 years ago

I've just generated more up-to-date figures as follows:

$ aws s3 ls s3://govuk-assets-integration > integration-assets.txt

$ wc -l integration-assets.txt 
   65087 integration-assets.txt
irb> Asset.count
=> 63649
irb> Asset.unscoped.count
=> 65529
irb> Asset.where(:deleted_at.ne => nil).count
=> 1880
$ find /data/uploads/asset-manager/assets/ -type f | wc -l
65527
floehopper commented 6 years ago

I believe I have identified two reasons for the discrepancies. Here are some quick notes:

  1. There are approx 1579 assets in the database which have no S3 object matching their uuid. All of these assets (and a couple of hundred others) are marked as deleted. We believe these were already marked as deleted when we did the initial upload to S3 and therefore were not uploaded. We should upload the files to S3 for these deleted assets.

  2. There are approx 1137 objects on S3 whose uuid does not match any asset in the database. All of these S3 objects have a key of length 24 (e.g. 58cba21ee5274a16e8000030) rather than 36 (e.g. 00172d97-73b1-42dc-8c3e-7b90083f497b). We believe the former are a remnant from when we used the database ID vs a separate UUID as the S3 key. We think they can safely be deleted.

floehopper commented 6 years ago

The above seems to add up, because 65087 (no. of S3 objects) + 1579 - 1137 = 65529 (no. of assets in db).

So I've created the following issues to fix these problems:

Given the above, I'm now happy to close this issue.