Move list of Git-downloaded extensions into a YAML file

CanastaWiki / Canasta

MediaWiki Docker image for Canasta, an all-in-one MediaWiki stack for easy deployment and management of enterprise-ready MediaWiki on production environments.

https://www.canasta.wiki

MIT License

38 stars 28 forks source link

Move list of Git-downloaded extensions into a YAML file #291

Closed yaronkoren closed 7 months ago

yaronkoren commented 1 year ago

About 500 of the Canasta Dockerfile's 761 lines are devoted to one thing: downloading the roughly 120 extensions that Canasta gets via Git. The code for every download is roughly the same, so this represents a lot of duplicated code. This could be done much more cleanly by putting all the needed information about these extensions into a single file (probably YAML makes the most sense), then having the Dockerfile read this file and cycle through the extensions, downloading each one. Having this separate file could also make it easier for people to use Canasta as a base image for their own MediaWiki distributions, since presumably the big difference between one distribution and the next is the set of extensions used.

(Very similar code could be created for downloading skins - this issue covers only extensions, since the need for streamlining skin downloads seems much smaller. But it would be great to do both, I think.)

yaronkoren commented 1 year ago

Thinking more about this, it may make sense to also include the names of patch files in the YAML file. There are currently four extensions that are patched, and if the patch file information was moved into the YAML file, it would remove another 20 or so lines of code from the Dockerfile. (Ignoring the additional lines of code that would need to be added to handle all this stuff.) More importantly, it would remove another reason for images that inherit from Canasta to ever have to modify the Dockerfile.

By the way, here is what a section of the YAML file could look like for the SemanticScribunto extension, which is patched:

- SemanticScribunto: # v. 2.2.0
    url: https://github.com/SemanticMediaWiki/SemanticScribunto
    revision: 1c616a4c4da443b3433000d6870bb92c184236fa
    patches: semantic-scribunto-autoload.patch

(The code could assume that every patch file is located in the /_sources/patches directory.)

vedmaka commented 1 year ago

I think it's a good idea, and I like the approach with YAML files, but the main drawback needs to be considered - this approach makes it impossible to benefit from layers caching

Cached layers are not utilized much on Canasta currently (most of the extensions are downloaded in a single RUN command), but on Taqasta, we make use of it (by splitting extensions downloads in batches https://github.com/WikiTeq/Taqasta/blob/master/Dockerfile#L178)

This allows the builder to avoid re-downloading all the extensions when some specific extension version is changed and instead makes it re-use the cached version for most of the unchanged layers (for Taqasta, it's crucial to reduce build times as much as possible)

More importantly, it would remove another reason for images that inherit from Canasta to ever have to modify the Dockerfile.

I don't think this changes much for the images that inherit from Canasta - currently, any image that inherits from Canasta also inherits all the extensions that were downloaded during the base image build. If there is a need to change the extension version, the image removes the extension directory and downloads the one it wants. It's quick to do because it's just a single command

If migrated to YAML files the image that inherits from Canasta will have two options - either do it the same way as before, by deleting the directory and downloading the desired extension version, or modify the YAML file and run the Canasta script to redownload all the extensions accordingly to the modified YAML file, which is a pretty time-consuming operation ( the Canasta script also need to support this use-case, e.g., wipe the extensions directory before re-downloading )

Also, apparently, to be allow the inherting image to re-apply patches, it'd be necessary to keep all the .git directories in place on the resulting Canasta image, which will grow the image size noticeable

yaronkoren commented 1 year ago

Yes, I was partly wrong about such a change making it easier for Canasta-based images to define their own set of extensions and skins. After all this time, I am still not very knowledgeable about Docker, so I keep thinking about a Docker base image as something like a "base class" that can be overridden, even though that is definitely not the case (as you point out).

However, I say "partly" because there may still be ways in the future to allow for building the Canasta image with a custom set of extensions and skins. (Without having to download, delete and then re-download.) One option is to have environment variables that define the name of the YAML files to use, instead of just the default canasta-extensions.yaml and canasta-skins.yaml (or whatever they would be called). Another is to to turn Canasta into two Docker images - a base image containing only MediaWiki, and a 2nd image containing the extensions and skins. (Although with that approach, the YAML file setup would not really change anything either way.)

I didn't understand the thing about layer caching. What is the ideal approach: a single RUN command to get all the extensions, a separate RUN command for each extension, or something in between, like a RUN command for each letter of the alphabet, or a RUN command for every five extensions? Whatever approach is best, my guess is that the Dockerfile code that handled the YAML files could do that.

vedmaka commented 1 year ago

What is the ideal approach: a single RUN command to get all the extensions, a separate RUN command for each extension

The ideal approach is to have separate RUN commands for each execution because these RUN commands are converted into layers that can be cached and then re-used if the RUN statement is unmodified

Whatever approach is best, my guess is that the Dockerfile code that handled the YAML files could do that.

I don't think so. Any code on the Dockerfile will be a RUN by itself. This RUN will be re-evaluated each time the command changes or the source YAML file changes, but I don't see any way to achieve the same caching as with many RUNs. But as mentioned this won't be much different from how it's composed on Canasta right now (with most of the extensions installed in a single RUN command)

so I keep thinking about a Docker base image as something like a "base class" that can be overridden, even though that is definitely not the case (as you point out)

It would be nice, but yes, unfortunately, image inheritance is mostly a filesystem inheritance

Another is to to turn Canasta into two Docker images - a base image containing only MediaWiki, and a 2nd image containing the extensions and skins

At some point, we considered this approach at Taqasta - to extract the extensions bundle into a separate image. This is a livable option. However, it complicates things a lot, creating extra dependency and version management needs. Noting that we were considering this solely for increasing build speed, not for extendability or inheritance

One option is to have environment variables that define the name of the YAML files to use, instead of just the default canasta-extensions.yaml and canasta-skins.yaml (or whatever they would be called)

That's an option. However, this probably has nothing to do with the inheritance - with this, you can take the original Cansata and simply build it supplying different ARGs or different YAML file contents. But you still need to build it all from scratch even if there is a single change in the list of the extensions

So far, I don't see any good options to allow other images inherited from Canasta to modify the list of extensions installed. It looks like the common approach is the best one so far, e.g., when you inherit your image from Canasta and do necessary updates like below:

FROM canasta
RUN rm -rf canasta-extensions/Something
RUN git clone ... canasta-extensions/SomethingNew
RUN sed .... composer.local.json
RUN composer update --no-dev
CMD ["/run-apache.sh"]

yaronkoren commented 1 year ago

@vedmaka - thanks for this response. The whole issue with wanting to just override one extension in Canasta is pretty easily solvable, I think - if you allow the environment variables to set an array of YAML files for extensions and/or skins, instead of just a single YAML file, then an inheriting image could have a setting like:

EXTENSIONS_YAML=("canasta-extensions.yaml" "/src/nicewiki/nicewiki-extensions.yaml")

nicewiki-extensions.yaml could then hold just a few lines, like:

- AdminLinks:
    revision: abcde01........

For any duplicates, Canasta would use the information from the last listing encountered - letting anyone build a version of Canasta with just a set of overrides to the group of extensions and skins.

What about removing an extension or skin that was already listed? I suppose the YAML file could also contain lines like this:

- AdminLinks: # we don't want this extension

If there were no settings for a particular extension or skin, the code would simply skip over it. (That's what it should do even if there were no overriding capability, I would think.)

What do you think?

As for the caching stuff - I have a question: why is it necessary for you to reduce build times as much as possible? What is the harm if (to take an extreme example) setting up a wiki takes 20 minutes instead of 2 minutes? I assume you are not trying to generate thousands of wikis a day.

vedmaka commented 1 year ago

I still do not think this improves anything - the reason is that regardless of how the YAML files are composed, the inheriting image will have to run something to process the YAML files and install extensions accordingly to the YAML files.

Since the image is inherited from Canasta, the filesystem will already contain an extensions directory with all the extensions and their composer dependencies installed there, patches, etc. So if the inhering image runs something to process the YAMLs again, this something should be intelligent enough to take this situation into account ( or do rm -rf vendor && rm -rf extensions and start it all from scratch )

It does not look like a set of YAML files makes this any different from having a single YAML file. You still need to rebuild everything. I'd guess it even makes things more complicated because you need to have a canasta-extensions.yaml in place ( assuming what we're taking here is Docker inheritance, not GitHub forking )

As for the caching stuff - I have a question: why is it necessary for you to reduce build times as much as possible? What is the harm if (to take an extreme example) setting up a wiki takes 20 minutes instead of 2 minutes? I assume you are not trying to generate thousands of wikis a day.

The main reason is to speed up CI builds as much as possible, waiting for 20-30 minutes on each iteration is quite destructive for our development & delivery processes

yaronkoren commented 1 year ago

@vedmaka - thank you for your answers, and your patience so far. There is a lot that I still need to learn about Docker, but this discussion is helping me quite a bit.

So - you are right that trying to have Canasta "child" images modify the original extension YAML probably would not work. But what about the option you suggested before, of having a child image go into the canasta-extensions/ and canasta-skins/ directories and make deletions and additions as necessary? That way (I think) the build process will remain speedy even if Canasta switches to a YAML approach - since the entire base image of Canasta will get cached as a single layer, even if the build process for Canasta itself is slow.

Is that correct? And if that seems like a reasonable approach, do you still have any objections to having Canasta itself load extensions and skins via YAML files?

vedmaka commented 1 year ago

I am happy to help!

Yes, the child image doing stuff it needs on the extensions and skins directories (and optionally running composer) seems to be the best way for this. Regardless of how the files initially occurred by the original image, this is the fastest option to make mods by the child image. And yes, the parent Canasta image will stay cached.

I have no objections to Canasta itself migrating to YAML, my only notice is that this will slow down builds, but I assume this is not super important for the Canasta workflow

yaronkoren commented 8 months ago

Here is some pseudo-code on how I think a single line of the YAML should be handled, by the way:

if ( url is set )
   if ( branch is not set )
       branch := "master"
   end if
else
   url := "https://github.com/wikimedia/mediawiki-extensions-" + extensionName
   if ( branch is not set )
       branch := $MW_VERSION
   end if
end if

git clone --single-branch -b branch url $MW_HOME/extensions/extensionName \
    && cd $MW_HOME/extensions/extensionName \
    && git checkout -q revision \