benwbrum / fromthepage

FromThePage is a wiki-like application for crowdsourcing transcription of handwritten documents.
http://fromthepage.com
GNU Affero General Public License v3.0
171 stars 51 forks source link

Application always requires all gems #4316

Open bencomp opened 2 months ago

bencomp commented 2 months ago

I tried to bundle install without the development and test groups of gems, but then the application would not start. I reported this in https://github.com/benwbrum/fromthepage/issues/4291#issuecomment-2326368286

I just found out that not installing gems in certain groups causes the application to fail during startup, because https://github.com/benwbrum/fromthepage/blob/1432ebdefc2d0a90a4dbba4d8e4d91a0173af032/config/application.rb#L7 explicitly loads all gems. Is it possible to be more cautious in loading gems? I'm thinking it could make FromThePage run with a smaller memory footprint too. I don't want to use OpenAI or Bento, so I don't want to install, let alone load those gems.

The OpenAI and Bento gems are bad examples, because the application itself relies on them. Capistrano and related gems are better examples, since they are used for deployment only (and I would not need them when I deploy the application in a Docker container).

The Bundler documentation on groups says:

By default, a Rails generated app calls Bundler.require(:default, Rails.env) in your application.rb, which links the groups in your Gemfile to the Rails environment. If you use other groups (not linked to a Rails environment), you can add them to the call to Bundler.require if you want them to be automatically required.

1e8fa225c602d914c676df2534e1ac90df8d5090 was the latest change to this line in application.rb; it's not clear from the commit message or in-line comments why this was necessary. I think it should be possible to change the call so that fewer gems are required at runtime.

benwbrum commented 2 months ago

Interesting -- a lot of those dependencies are regulated by global constants set in the initializer, as we need to control access to the features that depend on them in the UI. I wonder if it would be possible to move some of those to bundler groups, then load the groups based on the constants?

bencomp commented 2 months ago

@benwbrum I just edited my first comment to clarify that it's more likely that capistrano could be left out than ruby-openai. Making OpenAI (or Bento or Transkribus or ...) optional would be interesting, but that kind of customisation can definitely wait. I do consider this issue to be about optimisation; I haven't tested if this indeed improves performance by reducing memory footprint and/or loading times.

benwbrum commented 2 months ago

Let's talk a little more about open source distributions. Right now we have a development process that goes like this:

Feature development happens in feature branches based off of development feature a + feature b + feature c -> development

Deployment happens in two different branches

Since there is code specific to the SaaS business that open source installations don't want, we have some commits which remove this from main. So we could add code to remove some of these dependencies from main, but the problem is that we don't know which dependencies are actually being used "in the wild". It's possible that the University of Melbourne is using capistrano, even if Utrecht isn't.

Thoughts?

WillNigel23 commented 2 months ago

There is also the consideration of how huge of an impact does it make on the docker size and speed. We will also have to coordinate to open-source that we are doing this and they have to re-add the necessary gems back if they need it (which again as @benwbrum mentioned we still don't know fully which is which)

But I agree that we do need to identify which is necessary for each cases (development, test, opensource, etc)

bencomp commented 2 months ago

Thanks for clarifying, @benwbrum! I would still like to differentiate between making FromThePage features optional (like OpenAI and Transkribus integrations, and SaaS-only aspects like redirecting the homepage to /landing) and understanding which gems are used only in development or testing, which are used for deploying and which are only used in production on the server. I understand that there isn't always a clear distinction to be made and my knowledge of Ruby on Rails is still limited. I also don't want to interfere in your business :)

As @WillNigel23 says, coordinating with admins of other FromThePage instances is necessary, because it is likely that installation instructions may change.

This issue should be considered together with #4291 as I think that no gems need to be removed from the Gemfile, just grouped (except uglifier, as it is not used). It is very likely that other parts of the code need to be changed, to be more selective in the gems that certain parts of the code require or when they require gems, like in config/initializers/openai.rb:

require 'openai'

if ENABLE_OPENAI
  OpenAI.configure do |config|
    config.access_token = OPENAI_ACCESS_TOKEN
  end
end

Perhaps require 'openai' could be moved to after if? Note that I'm again using OpenAI integration as an example, which is an application feature, but it was the easiest example to find.

After #4291, I envision that installation instructions may include something like:

  • If you do not use Capistrano to deploy, you can bundle config set --local without capdeploy before bundle install.
  • If you precompile assets on your local machine and copy them from the public/ directory to the server, you can bundle config set --local without assets on the server before bundle install.

These examples are based on an idea to put the capistrano-related gems in a :capdeploy group and all gems needed for asset compilation in the :assets group in the Gemfile.

Perhaps some of the SaaS features/specifics could be extracted from the shared codebase into separate gems, eventually, so that building and deploying the SaaS instead of the basic version is a matter of configuration to include the extras instead of applying specific commits to disable and/or remove SaaS specifics. Or the views as they are should be updated to render SaaS specifics only if ENV["SAAS"] is set, for example. I'm probably oversimplifying here.

benwbrum commented 2 months ago

Good point on differentiating between product features and deployment-specific code.

In an ideal world, development would contain only shared code, fromthepage.com would contain only SaaS code, and main would contain F/OSS installation-specific code. In that case, trial account creation and landing pages would live in fromthepage.com, the container stuff would live in main, and neither would be in development. We tried something like this for a while, but discovered that Github's automated PR merge system kept merging fromthepage.com into development before merging development into fromthepage.com, so whenever we'd cut a new release on main we'd discover all this SaaS-specific wording that was inappropriate. As a result, we merge on the command line whenever we create a new release on main, and main contains the F/OSS installation specific commits that remove the landing pages, etc. This is a pain, but since most of the development activity is between development and fromthepage.com, it's what we've optimized for.

So I think that it makes sense to conditionally load the OpenAI gem (and others) in development, but some of the SaaS-specific extraction and F/OSS deployment code should live in main.

Notifying other installations is also tricky -- we are friends with the folks running installations at University of Texas-Austin, University of Nebraska-Lincoln, and of course University of Leiden, and have corresponded briefly with folks at Utrecht and Melbourne, but we have no contact at all with the people running installations at Bolsano, Taipei, or Buenos Aires (and of course there are likely installations which we don't even know about.)

Technically: I'd be very happy to merge the conditionalization of require if you find that it works.

bencomp commented 4 weeks ago

Did you consider creating a script and/or patch file to perform the steps that make the code on the development branch ready for the main branch (or a release)? During the build of the container image I also run some steps to e.g. remove test code and fixtures and remove the newrelic_rpm dependency from Gemfile and Gemfile.lock. Since I build from the development branch by default, I feel that a script (that lives in the same branch) could be helpful to transform the code into "production" code. If this works, you could automatically keep main in sync with development if you wanted.