OregonDigital / OD2

Next generation of Oregon Digital ( https://oregondigital.org ) digital collections platform, built on Samvera Hyrax ( https://github.com/samvera/hyrax/ )
18 stars 1 forks source link

Use a named temporary directory for OCR jobs #2270

Open decimalator opened 2 years ago

decimalator commented 2 years ago

Descriptive summary

Derivative jobs that run Tesseract create their temporary files without specifying a path, causing them to be created in the CWD of the app (/data)

Defined in: oregon_digital/hocr_derivative_service.rb

      def temporary_output
        @temporary_output ||= Tempfile.new
      end

Tempfile.new()

new(basename="", tmpdir=nil, mode: 0, **options)

Output files from Tesseract:

20220408-40-1010b8s                  20220408-40-hfpucb                   build                                od220220408-40-1vy37qn.png
20220408-40-1010b8s.hocr             20220408-40-hfpucb.hocr              config                               od220220408-40-1w19tt0.png
20220408-40-14tld02                  20220408-40-jdiubi                   config.ru                            od220220408-40-1yz8x4l.png
20220408-40-14tld02.hocr             20220408-40-radpif                   db                                   od220220408-40-8g05f2.png
20220408-40-179lvfx                  20220408-40-radpif.hocr              docker-compose.override.yml-example  od220220408-40-jmtclg.png
20220408-40-179lvfx.hocr             20220408-40-taic4n                   docker-compose.yml                   od220220408-40-m1cd6k.png
20220408-40-1fo71vt                  20220408-40-taic4n.hocr              fits.log                             od220220408-40-wdxiax.png
20220408-40-1fo71vt.hocr             20220408-40-w4keen                   lib                                  package.json
20220408-40-1g6iucs                  Gemfile                              log                                  public
20220408-40-1g6iucs.hocr             Gemfile.lock                         node_modules                         spec
20220408-40-1rbc9m5                  README.md                            od220220407-40-4c6ezy.jp2            tmp
20220408-40-1rbc9m5.hocr             Rakefile                             od220220408-40-19ff4yv.png           vendor
20220408-40-1y9shxr                  app                                  od220220408-40-1gozlc0.png           yarn.lock
20220408-40-1y9shxr.hocr             bin                                  od220220408-40-1oxzb0t.png

Expected behavior

Temporary files should be created in a standard location. For better performance we should consider writing these to a RAM disk. The directory to be used for each derivative type should be configurable through Environment Variables at runtime.

decimalator commented 2 years ago

I would suggest we move to a per-content-type temporary file base directory to allow us more flexibility.

With derivative generation we want temporary files to be blazing fast, so these should be getting written to a RAM disk. We should investigate grabbing the size of the source file to determine where to write temporary files. If a file is below a given size, temporary files should be written to a RAM disk. If it is bigger than N bytes, to avoid filling the RAM disk and exhausting the node's RAM we'll have to write large temporary files to a local disk.

We can add a Memory type emptyDir volume (/ramdisk) for each worker. When the Pod is started, the RAM disk gets created and mounted at the directory we give it. If the container restarts/crashes the data will persist, but when we boot a new version of the container or move them between nodes the data won't be persistent. This is fine for temporary files, it's even a helpful feature to avoid orphaned temporary files from accumulating. We'll need to define a benchmark file size that is small enough to use the RAM disk for,

For larger items, we'll have to write those to disk. We'll investigate enhancements to local container storage to make those as fast as possible.

We have the Hydra::Derivatives.temp_file_base. We should add some additional temp_file_base config variables and use the appropriate temp_file_base location depending on the derivative content type.

Each of these new variables should be set from corresponding environment variables.

Are there other content types that need their own temp_file_base?

module OregonDigital::Derivatives::Image
  # Simple derivative utility functions
  class Utils
    class << self
      # Generates a temporary file and passes its path to the given block.  The
      # file is deleted at the end of execution
      def tmp_file(ext)
        f = Tempfile.new(['od2', ".#{ext}"], Hydra::Derivatives.temp_file_base)
        begin
          yield f.path
        ensure
          f.close
          f.unlink
        end
      end