Rails / ActiveStorage Crawling Strategy

Unsure if this anyone has raised this before (or even encountered) but I'm trying to figure out if I can replace Jekyll with a very lean rails app to run my blog locally and then use parklife to generate a static build .

All good so far - base url, assets, etc - but I've now hit an ActiveStorage issue where I'm taking advantage of the thing to preprocess images (webp / srcset).

i.e. on my view I'm using something like

        <%= picture_tag do %>
            <%= tag(:source, srcset: image.representation(:webp).processed.url) %>
            <%= tag(:source, srcset: image.representation(:thumb).processed.url) %>
            <%= tag(:source, srcset: image.representation(:large).processed.url) %>
            <%= image_tag(image.representation(:medium).processed.url) %>
        <% end %>

which generates

<picture>
            <source srcset="http://127.0.0.1:3000/file-depot/disk/eyJfcmFpbHMiOnsiZGF0YSI6eyJrZXkiOiJ2dmRjb2ZkeWJ2bzVnM294N21pZHptYzB2ajgzIiwiZGlzcG9zaXRpb24iOiJhdHRhY2htZW50OyBmaWxlbmFtZT1cImRvZXMtYWktZHJlYW0tb2YtdW5yZWFsaXN0aWMtbW90b3JjeWNsaW5nLmpwZ1wiOyBmaWxlbmFtZSo9VVRGLTgnJ2RvZXMtYWktZHJlYW0tb2YtdW5yZWFsaXN0aWMtbW90b3JjeWNsaW5nLmpwZyIsImNvbnRlbnRfdHlwZSI6ImltYWdlL3dlYnAiLCJzZXJ2aWNlX25hbWUiOiJsb2NhbCJ9LCJleHAiOiIyMDI0LTAxLTAxVDE4OjAwOjQ0LjY1MloiLCJwdXIiOiJibG9iX2tleSJ9fQ==--3661428e79eb210f114a5a99d0bb32cdd4b3e8c7/does-ai-dream-of-unrealistic-motorcycling.jpg">
            <source srcset="http://127.0.0.1:3000/file-depot/disk/eyJfcmFpbHMiOnsiZGF0YSI6eyJrZXkiOiJoOWVkeHoyY2YxMXB5bmJ2OW41dWF3NmJ0ZWlxIiwiZGlzcG9zaXRpb24iOiJpbmxpbmU7IGZpbGVuYW1lPVwiZG9lcy1haS1kcmVhbS1vZi11bnJlYWxpc3RpYy1tb3RvcmN5Y2xpbmcuanBnXCI7IGZpbGVuYW1lKj1VVEYtOCcnZG9lcy1haS1kcmVhbS1vZi11bnJlYWxpc3RpYy1tb3RvcmN5Y2xpbmcuanBnIiwiY29udGVudF90eXBlIjoiaW1hZ2UvanBlZyIsInNlcnZpY2VfbmFtZSI6ImxvY2FsIn0sImV4cCI6IjIwMjQtMDEtMDFUMTg6MDA6NDQuNzE5WiIsInB1ciI6ImJsb2Jfa2V5In19--795726aff0a5aec48dbe9dbe77eb9cff7c0295ec/does-ai-dream-of-unrealistic-motorcycling.jpg">
            <source srcset="http://127.0.0.1:3000/file-depot/disk/eyJfcmFpbHMiOnsiZGF0YSI6eyJrZXkiOiJxa252dHJ6bW02enl5bXp4ZXE3MndlbThldTk0IiwiZGlzcG9zaXRpb24iOiJpbmxpbmU7IGZpbGVuYW1lPVwiZG9lcy1haS1kcmVhbS1vZi11bnJlYWxpc3RpYy1tb3RvcmN5Y2xpbmcuanBnXCI7IGZpbGVuYW1lKj1VVEYtOCcnZG9lcy1haS1kcmVhbS1vZi11bnJlYWxpc3RpYy1tb3RvcmN5Y2xpbmcuanBnIiwiY29udGVudF90eXBlIjoiaW1hZ2UvanBlZyIsInNlcnZpY2VfbmFtZSI6ImxvY2FsIn0sImV4cCI6IjIwMjQtMDEtMDFUMTg6MDA6NDQuNzU4WiIsInB1ciI6ImJsb2Jfa2V5In19--eaffa6680cf7c4901ad2990a98073ad16fa244f0/does-ai-dream-of-unrealistic-motorcycling.jpg">
            <img src="http://127.0.0.1:3000/file-depot/disk/eyJfcmFpbHMiOnsiZGF0YSI6eyJrZXkiOiJlbXRjcjhnaGxvZGRlNzNneGd5dmFrZGp0NTkyIiwiZGlzcG9zaXRpb24iOiJpbmxpbmU7IGZpbGVuYW1lPVwiZG9lcy1haS1kcmVhbS1vZi11bnJlYWxpc3RpYy1tb3RvcmN5Y2xpbmcuanBnXCI7IGZpbGVuYW1lKj1VVEYtOCcnZG9lcy1haS1kcmVhbS1vZi11bnJlYWxpc3RpYy1tb3RvcmN5Y2xpbmcuanBnIiwiY29udGVudF90eXBlIjoiaW1hZ2UvanBlZyIsInNlcnZpY2VfbmFtZSI6ImxvY2FsIn0sImV4cCI6IjIwMjQtMDEtMDFUMTg6MDA6NDQuNzk5WiIsInB1ciI6ImJsb2Jfa2V5In19--45cff08b9be04ee48f5dc5ee11b80605b28df079/does-ai-dream-of-unrealistic-motorcycling.jpg">
</picture>

Running the build script looks ok too, it's generating urls using the --base I need BUT it won't actually crawl the images and so they never end up in the build folder .

I'm unsure how the crawling process work so I'm unsure if it's a problem with not gathering those urls (some are inside a source tag , that might be why) or something else. Regular assets are handled by assets:precompile + cp but this is a slightly different issue and I'm wondering if you have some words of wisdom here :)

Thanks

My own site is a small Rails app but it's still using Paperclip for images because I haven't got around to answering this very question 😬 I have a vague memory of looking into using pre-defined/named variants https://www.bigbinary.com/blog/rails-7-adds-ability-to-use-predefined-variants but I can't have reached a conclusion.

Thanks for creating this issue, hopefully it will motivate me and we can find an answer.

I am using variants to build the srcset (it's like paperclip in that sense, before variants you would define transformations on views and such, variants are just pre-defined transformations) -- I was looking at the crawler class and it really only processes <a> tags, which I should have guessed, otherwise static assets would come along as well :)

I've got two minds for this:

one would be to question "can the crawler basically look for any relative / base-domain and just crawl / get everything"? , but I'm sure you have a very good reason why you're specifically looking for a tags :)
Given I do have pre-defined variants and all the database I want, I could probably build a script to generate and copy files into the build folder .. maybe :) -- it would never work runtime transformations (i.e. not pre-defined variants) but maybe ...

Paperclip would just write the assets in the public folder, was it? (I should know, my memory selectively blocked that for now :D )

ok, a minor update

I tried (2) from my comment above, I can write a script / rake task that goes on to generate URLS for every variant representation in the database, yada yada yada, BUT rails generates signed urls (with expiration) and I didn't went down the rabbit hole deep enough to ensure a URL generated on the console ends up being the same as the rack app will eventually generate. That's the blocker there so far, because otherwise I could just generate the URL and copy the file over to the build folder and be done with it - possibly

anyhoo, I then gave up and via the power of bundle exec gem open parklife 🙈 I patched the scan_for_links method to be slightly more comprehensive

    def scan_for_links(html)
      doc = Nokogiri::HTML.parse(html)

      urls = doc.css('a').map { |v| URI.parse(v[:href]) }
      urls.concat(doc.css('img').map { |v| URI.parse(v[:src]) })
      urls.concat(doc.css('source').map { |v| URI.parse(v[:srcset]) })
      # urls.concat(doc.css('[style*="background-image]').map { |v| URI.parse(v[:srcset]) })

      urls.each do |uri|

        # Don't visit a URL that belongs to a different domain - for now this is
        # a guess that it's not an internal link but it also covers mailto/ftp
        # links.
        next if uri.host

        # Don't visit a path-less URL - this will be the case for a #fragment
        # for example.
        next if uri.path.nil? || uri.path.empty?

        yield uri.path
      end
    end

this works fine, except that in my design I ended up needing to put images as style='background-image and that's slightly messy to content with -- a bit too much I would say , so I'll probably just change the markup

anyways, unsure how you feel about this or what scenarios you might have run into before that kept scan_for_links to stick to <a> tags, but I would love to hear you thoughts :)

I think Parklife's Rails integration might be able to take care of most of this by subscribing to the service_url.active_storage instrumentation event, noting the blobs referenced while the site is crawled, and then copying the relevant files to the build directory (or exposing them for you to do whatever). BUT it doesn't seem to work - I only ever see the events from a single attachment 🤷🏻‍♂️ So it doesn't work, but I think it should! I need to spend more time on it.

benpickles / parklife

Rails / ActiveStorage Crawling Strategy #102