abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

PERF: Eliminate dependency on mime-types gem #17

Closed BigBigDoudou closed 3 years ago

BigBigDoudou commented 3 years ago

Inspired by: https://github.com/rest-client/rest-client/pull/644

Eliminate the dependency on mime-types which is a memory hog.

The gem mini_mime (https://github.com/discourse/mini_mime) was created in replacement. It uses the exact same database as the full fledged mime types and is capable of only loading stuff on demand with a practical, safe and bound in-memory cache.

Allocated memory by gem (derailed benchmarks):

gem memory
mime-types 6256476
mini_mime 7606
abrom commented 3 years ago

Thanks @BigBigDoudou

I like the change in concept but there are a few things that may be an issue.

First, this is definitely a breaking change. Anyone using the mime-type response would need to change from extensions to extension. Although it'd likely be a simple change in most instances, it does highlight one of the core differences between the two projects (as highlighted in https://github.com/discourse/mini_mime/issues/25). The highest priority non-obsolete match may not be enough in some circumstances. For example, say you wanted to check that a file had the correct extension to match the content type. With this change it'd be hit and miss.

Second, the referenced PR against rest-client was created in 2017 with really no movement on it. That does give me some cause for concern.. That may just be because rest-client has reached EOL? The last time any changes appear to have been made was in 2019.

Third, the last time the mini_mime DBs were updated was 2019. AFAICT there have been 6 released updates of the mime-types-data gem since then. Possibly a backwards step with regressions for the recently added/modified mime-types.

Again, I think that in concept the change is sound, I'm just not sure the benefits necessarily outweigh the detractors at this point.

FYI i've fixed the issues which appear to have caused the CI failures. A rebase should see them fixed on your branch.

BigBigDoudou commented 3 years ago

Hi @abrom , thank you for this feedback.

I've removed the modification on specs to ensure the API remains the same (kind of a workaround for extensions but it works fine).

About mini_mime DBs, I don't know if this is a message from the universe but they updated it few hours after your comment (cross my heart I didn't do anything ^^): https://github.com/discourse/mini_mime/commit/7ea1a8d41def35a7ac90fe33a0b7465bf79b823c.

Your argument about the fact that "the highest priority non-obsolete match may not be enough in some circumstances" remains valid so and indeed mini-mime won't be able to respond to that. But do these "some circumstances" exist?

I let you come back to me if you have any suggestions :)

abrom commented 3 years ago

That's some fortuitous timing for the mini_mime gem update! Nice

My use-case for needing (well.. sort of) the complete list of extensions returned is because I do some content to file type matching as a validation of an uploaded file. Ensures people aren't accidentally uploading janky files by mistake (well.. it helps). Having said that, that problem could easily solved by switching the check around. Ie get the content type of the file extension and compare it to the content type from the metadata.

One thought to try minimise potential impact would be to have a soft switch over. The gemspec would change per your PR, but the .read and #mimetype methods might use something like:

def self.mimetype(content_type)
  if @some_option_deciding_which_to_use && defined?(MIME::Types)
    warn "[DEPRECATION] `MIME::Types` is deprecated.  Please use `MiniMime` instead. etc etc how to switch over..."
    MIME::Types[content_type].first
  else
    # could possibly consider dropping the `extensions` method if there is some option for keeping existing functionality above?
    MiniMime.lookup_by_content_type(content_type).tap do |object|
      object.define_singleton_method(:extensions) { [extension] }
    end
  end
end
private_class_method :mimetype

It means that so long as the mime-types gem is loaded then there's no functional change (except for the deprecation warning, which would explain how to switch). I suspect the "option deciding which to use" would best default to the old class (if it's defined) with the fallback being the new class/gem.

BigBigDoudou commented 3 years ago

I'm trying things but I'm not sure how to handle this @some_option_deciding_which_to_use option. I've tried with a classical initializer, would that be ok for you @abrom ?

The scenarios for any gem client:

Caveat: if the user has mime/types but with a version not supported by Henkei, this could lead to troubles... But we're far from v4 for now.

abrom commented 3 years ago

Change is looking really good @BigBigDoudou 👍

I've left a suggestion RE the configuration setup, but I think otherwise GTG

BigBigDoudou commented 3 years ago

@abrom Thank you for the suggestion, it's really much cleaner now! Hope it's all fine for you, I let you go with it when you're ready :)

abrom commented 3 years ago

FYI this has been released in v1.23.3

Note I've also just released v1.24.1 (update to Tika 1.24.1) and v1.25.1 (update to Tika 1.25)

Thanks again