Open decimalator opened 2 years ago
I would suggest we move to a per-content-type temporary file base directory
to allow us more flexibility.
With derivative generation we want temporary files to be blazing fast, so these should be getting written to a RAM disk. We should investigate grabbing the size of the source file to determine where to write temporary files. If a file is below a given size, temporary files should be written to a RAM disk. If it is bigger than N
bytes, to avoid filling the RAM disk and exhausting the node's RAM we'll have to write large temporary files to a local disk.
/ramdisk/video
)/local/video
)/ramdisk/audio
)/local/audio
)/ramdisk/ocr
)/local/ocr
)/ramdisk/general
)/local/general
)We can add a Memory
type emptyDir
volume (/ramdisk
) for each worker. When the Pod is started, the RAM disk gets created and mounted at the directory we give it. If the container restarts/crashes the data will persist, but when we boot a new version of the container or move them between nodes
the data won't be persistent. This is fine for temporary files, it's even a helpful feature to avoid orphaned temporary files from accumulating. We'll need to define a benchmark file size that is small enough to use the RAM disk for,
For larger items, we'll have to write those to disk. We'll investigate enhancements to local container storage to make those as fast as possible.
We have the Hydra::Derivatives.temp_file_base
. We should add some additional temp_file_base
config variables and use the appropriate temp_file_base
location depending on the derivative content type.
Each of these new variables should be set from corresponding environment variables.
Are there other content types that need their own temp_file_base
?
module OregonDigital::Derivatives::Image
# Simple derivative utility functions
class Utils
class << self
# Generates a temporary file and passes its path to the given block. The
# file is deleted at the end of execution
def tmp_file(ext)
f = Tempfile.new(['od2', ".#{ext}"], Hydra::Derivatives.temp_file_base)
begin
yield f.path
ensure
f.close
f.unlink
end
end
Descriptive summary
Derivative jobs that run Tesseract create their temporary files without specifying a path, causing them to be created in the CWD of the app (
/data
)Defined in:
oregon_digital/hocr_derivative_service.rb
Tempfile.new()
Output files from Tesseract:
Expected behavior
Temporary files should be created in a standard location. For better performance we should consider writing these to a RAM disk. The directory to be used for each derivative type should be configurable through Environment Variables at runtime.