GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
652 stars 102 forks source link

Migrate WordPress assets from FCS S3 #3541

Closed adborden closed 2 years ago

adborden commented 3 years ago

User Story

In order to prevent images and other static assets from breaking once FCS resources are deleted, data.gov team wants the WordPress S3 assets migrated out of FCS before the buckets are deleted.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

If assets are required to be stored in S3, then the S3 bucket should be provisioned with cloud.gov which satisfies our compliance requirements.

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

adborden commented 3 years ago

Currently in the production bucket under /datagov/wordpress: Total Objects: 7407 Total Size: 856.6 MiB

jbrown-xentity commented 3 years ago

We should make sure any usage of the current bucket in the static site is migrated to the new s3 bucket...

adborden commented 3 years ago

Chatted with @robert-bryson about this. Ideally we'd keep assets in the repo because it's not easy for editors to upload directly to an S3 bucket in cloud.gov. We're going to run some build tests to make sure that adding 856MB won't slow down the build too much.

OR we can keep only the old assets in S3 and newer assets in the repo.

adborden commented 3 years ago

Groomed, thanks @jbrown-xentity

mogul commented 2 years ago

I moved this to In Progress because Aaron started copying out the entire FCS S3 bucket, and will soon have the WordPress assets available in another bucket accessible to the static site. At that point I think we would just need to search and replace the bucket name in the crawled pages to be done here... Is that right? (Leaving "should we get rid of the bucket and just include the assets in the static site" for another time.)

FuhuXia commented 2 years ago

production S3 copy finished. It took 24 hours.

(venv) ubuntu@wordpressweb1p (production) ~/datagov-s3-migrate$ time python migrate.py --use-ec2
...
real    1444m43.290s
user    128m45.062s
sys     70m9.456s
robert-bryson commented 2 years ago

Using creds from cf service-key fcs-lifeboat s3-migration, I copied the files to a local branch in GSA/datagov-website with aws s3 cp s3://${BUCKET_NAME}/datagov/wordpress/ . --recursive.

The files all together are only about 1gb, but 217mb of that are one large MOV file which throws an error:

remote: error: File www.data.gov/datagov/wordpress/2016/09/Scott_Smith_Message_Open_Data_Innovation_Summit.mov is 217.06 MB; this exceeds GitHub's file size limit of 100.00 MB
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.

This will have to be addressed but for now, as a proof of concept, I just deleted it.

I also changed references and links to the old s3 bucket to point to the files now in datagov/wordpress, with the hope that the build on Federalist will work well with those links. Pushed to trigger test build.


fake edit: github sure is thinking a lot about this push, but that might be a tomorrow problem.

real edit: push worked now for some reason.

FuhuXia commented 2 years ago

@robert-bryson I checked inventory files on the fcs-lifeboat. The file count is much less than the count on original fcs s3. Doing some debugging now.

robert-bryson commented 2 years ago

@robert-bryson I checked inventory files on the fcs-lifeboat. The file count is much less than the count on original fcs s3. Doing some debugging now.

The count for just ${BUCKET_NAME}/datagov/wordpress/ is off? I just grabbed that subset, I know the entire bucket is much larger. I will look into it in the morning.

FuhuXia commented 2 years ago

Last operation was done on an unknown bucket, not fcs-lifeboat. We have the connection details on the unknown bucket but not its service name. At this point we should find out the service name of the unknown bucket and rename it to fcs-lifeboat, or do a aws s3 sync from it to fcs-lifeboat. aws s3 sync should be faster than re-run the s3-migrate script, which will take 1444m (24h).

[UPDATE] located the unknown bucket. it is fcs-lifeboat in prod space. We renamed staging:fcs-lifeboatto fcs-lifeboat-staging, then shared prod:fcs-lifeboat to staging.

robert-bryson commented 2 years ago

The Federalist build failed with a disk quota error:

2021-12-14 16:20:47 INFO [build-jekyll] /usr/local/rvm/rubies/ruby-2.7.4/lib/ruby/2.7.0/fileutils.rb:1415:in `initialize': Disk quota exceeded @ rb_sysopen - /tmp/work/site_repo/_site/datagov/wordpress/2014/04/bkodhq8caaaoumn.png (Errno::EDQUOT)

Looks like we might not be able to have all the assets local to Federalist. Will look into it further, but likely will go with the other s3 bucket route.

robert-bryson commented 2 years ago

The kind folks at #federalist-support were able to increase our disk quota. We're back on the Federalist 🚂!

image and image and a new build (site) that I didn't quite do the relatively links correctly.

robert-bryson commented 2 years ago

I wrestled ruby and gem and jekyll for an embarrassing amount of time today. Once they finally yielded and I was able to have bundle exec jekyll serve run correctly to do local debugging of why certain assets were 404-ing, it became obvious that they were 404-ing because they weren't there.

aws s3 sync show 541 missing files (out of 7,407 total):

$ aws s3 sync s3://{$BUCKET}/datagov/wordpress/ . --dryrun | wc  -l
     541

A quick sync and new build, but images that should load don't due to Federalist prepending /preview/gsa/datagov-website/feature/3541-migrate-wordpress-s3-assets/ to the urls.

robert-bryson commented 2 years ago
A whole bunch of search/replaces - for files in `www.data.gov/*/*/*/*/*/` ![image](https://user-images.githubusercontent.com/91547795/146306047-de099b33-98db-474a-91a0-ebb63e691075.png) ![image](https://user-images.githubusercontent.com/91547795/146306291-90872a1f-6bc0-4fe1-afa6-34eec79da7ba.png) ![image](https://user-images.githubusercontent.com/91547795/146306734-d5254741-1bd8-49b4-a1b7-8232d9435126.png) ![image](https://user-images.githubusercontent.com/91547795/146306800-af32cf36-3222-477d-ae55-8fcb467f21ce.png) - for files in `www.data.gov/*/*/*/*/` ![image](https://user-images.githubusercontent.com/91547795/146307195-ab8939c0-ad8f-4cb8-9de9-db6d3b222fe9.png) - for files in `www.data.gov/*/*/*/` ![image](https://user-images.githubusercontent.com/91547795/146307873-da3a6b74-28b5-48ca-9f5b-6ece01b0bf0a.png) - for files in `www.data.gov/*/*/` ![image](https://user-images.githubusercontent.com/91547795/146308067-2a671c0b-0cd4-4f31-80a0-0b8334e5b1ab.png) - for files in `www.data.gov/*/` ![image](https://user-images.githubusercontent.com/91547795/146308383-b687a384-9067-4099-8338-e6c6161ea580.png)

and a new build and things still aren't working.

Federalist is doing something weird with the prefix. I dunno. It works locally.

robert-bryson commented 2 years ago

I was missing cases like srcset="datagov/wordpress/2019/04/IENC-example-300x243.jpg 300w, /datagov/wordpress/2019/04/IENC-example-768x621.jpg 768 w, /datagov/wordpress/2019/04/IENC-example.jpg 806w", which results in the smallest images loading, but not any of the larger responsive images: image but not image.

another new build and huzzah! Looks like it is working correctly.

robert-bryson commented 2 years ago

Successful Federalist build on main branch and demo site. 🎉 🎉

robert-bryson commented 2 years ago

A demo example with an image from the front page:

image Before this work, the WordPress assets were being hosted in an AWS S3 bucket.

image After this work, the assets are included in the Federalist repository and build.