We sometimes remove images from documentation and there's a chance we'll forget to remove them from Git. So I'm looking into what it would take to audit images and find the ones we aren't using.
Finding all the images we are using
To get a list of all the MDX files under a directory:
But that only works if there are no duplicate names in the files across different directory structures. Probably better to adjust the Lua filter to output absolute paths.
If you pipe the output of the files used command into one file (say, "used_images.txt") and the list of all images in another file ("all_images.txt", perhaps), you can diff those two lists to get the images that aren't used and (possibly) the images that aren't in Git:
Might make sense to do this one product version at a time. For instance, the list above is the first 10 images that aren't used in the EPRS 6.2 documentation on the branch I happen to be using.
We sometimes remove images from documentation and there's a chance we'll forget to remove them from Git. So I'm looking into what it would take to audit images and find the ones we aren't using.
Finding all the images we are using
To get a list of all the MDX files under a directory:
Then we can parse the MDX files using Pandoc. I found an example of how to extract the code from Markdown and adjusted it to extract images. Here's
extract_images.lua
:And the command to run it:
Putting everything together:
But that results in duplicates. So sort them:
It's also relative paths, which is awkward:
One approach would be to extract just the filename:
But that only works if there are no duplicate names in the files across different directory structures. Probably better to adjust the Lua filter to output absolute paths.
Finding all the image files in a directory
We can use
find
to list images:This gives a list relative to the current working directory, but we can fix that by using the absolute path in the find command:
Or use
basename
to just get the filename to match what we did in looking for images we are using:Find the differences
If you pipe the output of the files used command into one file (say, "used_images.txt") and the list of all images in another file ("all_images.txt", perhaps), you can diff those two lists to get the images that aren't used and (possibly) the images that aren't in Git:
Might make sense to do this one product version at a time. For instance, the list above is the first 10 images that aren't used in the EPRS 6.2 documentation on the branch I happen to be using.
Removing unused images
I removed the images with:
There are also copies of the images in another (unused) directory. So I removed them: