EnterpriseDB / docs

EDB Docs
https://www.enterprisedb.com/docs/
Apache License 2.0
47 stars 235 forks source link

Checking for images that are no longer used #2127

Open jericson-edb opened 2 years ago

jericson-edb commented 2 years ago

We sometimes remove images from documentation and there's a chance we'll forget to remove them from Git. So I'm looking into what it would take to audit images and find the ones we aren't using.

Finding all the images we are using

To get a list of all the MDX files under a directory:

find product_docs/docs/eprs/ -name '*.mdx'

Then we can parse the MDX files using Pandoc. I found an example of how to extract the code from Markdown and adjusted it to extract images. Here's extract_images.lua:

function Image(el)
  print(el.src)
end

And the command to run it:

pandoc --lua-filter extract_images.lua -o /dev/null [list_of_mdx_files]

Putting everything together:

find product_docs/docs/eprs/ -name '*.mdx' \
| xargs pandoc --lua-filter extract_images.lua -o /dev/null

But that results in duplicates. So sort them:

find product_docs/docs/eprs/ -name '*.mdx' \
| xargs pandoc --lua-filter extract_images.lua -o /dev/null \
| sort -u

It's also relative paths, which is awkward:

$ find product_docs/docs/eprs/ -name '*.mdx'| xargs pandoc --lua-filter extract_images.lua -o /dev/null | sort -u | head
../../images/image100.png
../../images/image101.png
../../images/image102.png
../../images/image103.png
../../images/image104.png
../../images/image105.png
../../images/image106.png
../../images/image107.png
../../images/image108.png
../../images/image109.png

One approach would be to extract just the filename:

find product_docs/docs/eprs/ -name '*.mdx' \
| xargs pandoc --lua-filter extract_images.lua -o /dev/null \
| xargs -l basename \
| sort -u

But that only works if there are no duplicate names in the files across different directory structures. Probably better to adjust the Lua filter to output absolute paths.

Finding all the image files in a directory

We can use find to list images:

find product_docs/docs/eprs/ -type f \
| xargs file | grep -o -P '^.+: +\w+ image' | cut -d: -f1 \
| sort

This gives a list relative to the current working directory, but we can fix that by using the absolute path in the find command:

find "$(pwd)"/product_docs/docs/eprs/ -type f \
| xargs file | grep -o -P '^.+: +\w+ image' | cut -d: -f1 \
| sort

Or use basename to just get the filename to match what we did in looking for images we are using:

find product_docs/docs/eprs/ -type f \
| xargs file | grep -o -P '^.+: +\w+ image' | cut -d: -f1 \
| xargs -l basename \
| sort

Find the differences

If you pipe the output of the files used command into one file (say, "used_images.txt") and the list of all images in another file ("all_images.txt", perhaps), you can diff those two lists to get the images that aren't used and (possibly) the images that aren't in Git:

diff all_images.txt used_images.txt | head
1,12d0
< aws_instance.png
< aws_instance.png
< edb_logo.png
< edb_logo.png
< efm_slot_old.png
< efm_slot_old.png
< efm_slot.png
< efm_slot.png
< google_security_settings.png

Might make sense to do this one product version at a time. For instance, the list above is the first 10 images that aren't used in the EPRS 6.2 documentation on the branch I happen to be using.

Removing unused images

I removed the images with:

diff all_images.txt used_images.txt | grep '<' | sed -e 's|< |product_docs/docs/eprs/6.2/images/|' | xargs git rm

There are also copies of the images in another (unused) directory. So I removed them:

git rm -r product_docs/docs/eprs/6.2/images/media/
josh-heyer commented 1 year ago

Possibly add this as a build step...