Implementing CPU Support for Comprimato J2K Compress and Decompress

This pull request implements CPU compression and decompression for the Comprimato J2K Codec.

In the interest of leaving the current implementation unchanged for users already using Comprimato with CUDA, CUDA is still the default platform selected for compression and decompression when CUDA libraries are detected during UltraGrid compilation. This means that -c cmpto_j2k will default to using CUDA for compression. A system decompressing a cmpto_j2k stream will default to using CUDA for decompression. This has been guaranteed by the use of #ifdef HAVE_CUDA directives.

If UltraGrid is compiled without CUDA, -c cmpto_j2k will default to using CPU for compression. A system compiled without CUDA will also default to CPU for decompressing a cmpto_j2k stream.

To allow for the selection of either CUDA or CPU, a new :platform= option has been included when CUDA libraries are detected and UltraGrid is compiled. If no CUDA libraries are detected at compile time (ex. MacOS), the :platform= option will be removed and CPU will be the assumed default with no option for the user to change.

Platform default (CUDA or CPU) is determined at compile time using #ifdef HAVE_CUDA directives in the state_video_compress_j2k and state_video_decompress_j2k structs.

Additions to Compression Options and Decompression Parameters

To support CPU usage for compression and decompression, additional options and parameters have been added.

J2K Compression Options Syntax | CUDA

-c cmpto_j2k:platform=cuda[:mem_limit=<m>][:tile_limit=<t>][:rate=<r>][:lossless][:quality=<q>][:pool_size=<p>][:mct] [--cuda-device <c_index>]

J2K Compression Options Syntax | CPU

-c cmpto_j2k:platform=cpu[:thread_count=<t>][:img_limit=<i>][:rate=<r>][:lossless][:quality=<q>][:pool_size=<p>][:mct]

New J2K Compression Options:

:platform=<cuda,cpu>

Option will only be available if CUDA libraries are compiled.
Current options include CUDA or CPU.
CUDA is default selected platform if compiled with CUDA. Calling -c cmpto_j2k will imply -c cmpto_j2k:platform=cuda. Calling -c cmpto_j2k:platform=cpu will explicitly select CPU compression over CUDA.

:thread_count=

Threads to use on the CPU. Default is CMPTO_J2K_ENC_CPU_DEFAULT (0), which is threads equal to all cores.
Comprimato defines a range 0-64, but system hardware may limit 64 to lower max. CTX Create will throw error if there is one and be caught by CHECK_OK() during bool initialize_j2k_enc_ctx()

:img_limit=

Number of images compressed by CPU at once. Default is 0, which matches CMTPO default.
Comprimato defines a range 0-64, but system hardware may limit 64 to lower max. CTX Create will throw error if there is one and be caught by CHECK_OK() during bool initialize_j2k_enc_ctx()

:lossless

Enable lossless encoding

New J2K Decompression params:

j2k-dec-cpu-thread-count=

Threads to use on the CPU. Default is CMPTO_J2K_DEC_CPU_DEFAULT (0), which is threads equal to all cores.
Comprimato defines a max range 0-64, but system hardware may limit 64 to lower max. CTX will throw error if there is one and be caught by CHECK_OK() during bool initialize_j2k_dec_ctx().

j2k-dec-img-limit=

Default number of images to be decompressed by the CPU. Default is 0, which is the same as CMPTO DEFAULT
img-limit cannot exceed thread-count, unless thread-count=0. Checks for this are done by bool initialize_j2k_dec_ctx() and will set img-limit = cpu-thread-count if exceeded.
Comprimato defines a max range 0-64, but system hardware may limit 64 to lower max. CTX will throw error if there is one and be caught by CHECK_OK() during bool initialize_j2k_dec_ctx().

j2k-dec-use-cpu

Param will only be available if CUDA libraries are compiled.
Explicitly use the CPU to decompress images.

j2k-dec-use-cuda

Param will only be available if CUDA libraries are compiled.
Explicitly use CUDA to decompress images.

`-c cmpto_j2k:help` print out when CUDA is Present

[Cmpto J2K enc.] Using Codec version: <INFO>
J2K compress platform support:
    CPU .... yes
    CUDA ... yes
J2K compress usage:
    -c cmpto_j2k:platform=cuda[:mem_limit=<m>][:tile_limit=<t>][:rate=<r>][:quality=<q>][:pool_size=<p>][:mct][:lossless] [--cuda-device <c_index>]
    -c cmpto_j2k:platform=cpu[:thread_count=<t>][:img_limit=<i>][:rate=<r>][:quality=<q>][:pool_size=<p>][:mct][:lossless]
where:
    <p> - Platform device for the encoder to use, default: cuda
    <m> - CUDA device memory limit (in bytes), default: 1000000000ULL
    <t> - Number of tiles encoded at one moment by GPU (less to reduce latency, more to increase performance, 0 means infinity). default: 1
    <c_index> - CUDA device(s) to use (comma separated)
    <t> - Number of threads to use on the CPU. 0 is all available. default: 0
    <i> - Number of images which can be encoded at one moment by CPU. Maximum allowed limit is thread_count. 0 is default limit. default: 0
    <r> - Target bitrate
    <q> - Quality in range [0-1]. default: 0.7
    <p> - Total number of frames encoder can hold at one moment. Should be greater than tile_limit or img_limit. default: 4
    mct - Use MCT
    lossless - Enable lossless compression. default: disabled

`-c cmpto_j2k:help` print out when only CPU is Present

[Cmpto J2K enc.] Using Codec version: <INFO>
J2K compress platform support:
    CPU .... yes
    CUDA ... no
J2K compress usage:
    -c cmpto_j2k[:thread_count=<t>][:img_limit=<i>][:rate=<r>][:quality=<q>][:pool_size=<p>][:mct][:lossless]
where:
    <t> - Number of threads to use on the CPU. 0 is all available. default: 0
    <i> - Number of images which can be encoded at one moment by CPU. Maximum allowed limit is thread_count. 0 is default limit. default: 0
    <r> - Target bitrate
    <q> - Quality in range [0-1]. default: 0.7
    <p> - Total number of frames encoder can hold at one moment. Should be greater than img_limit. default: 8
    mct - Use MCT
    lossless - Enable lossless compression. default: disabled
Exit

`--param help` when CUDA is Present

* j2k-dec-use-cuda
  use CUDA to decode images

* j2k-dec-mem-limit=<limit>
  J2K max memory usage in bytes.

* j2k-dec-tile-limit=<limit>
  number of tiles decoded at moment (less to reduce latency, more to increase performance, 0 unlimited)

* j2k-dec-use-cpu
  use the CPU to decode images

* j2k-dec-cpu-thread-count=<threads>
  number of threads to use on the CPU (0 means number of threads equal to all cores)

* j2k-dec-img-limit=<limit>
  number of images which can be decoded at one moment (0 means default, thread-count is maximum limit)

* j2k-dec-queue-len=<len>
  max queue len

* j2k-dec-encoder-queue=<len>
  max number of frames held by encoder

`--param help` when only CPU is Present

* j2k-dec-cpu-thread-count=<threads>
  number of threads to use on the CPU (0 means number of threads equal to all cores)

* j2k-dec-img-limit=<limit>
  number of images which can be decoded at one moment (0 means default, thread-count is maximum limit)

* j2k-dec-queue-len=<len>
  max queue len

* j2k-dec-encoder-queue=<len>
  max number of frames held by encoder

Changes to `state_video_compress_j2k` and `state_video_decompress_j2k` construction and struct members

In addition to adding CPU support for compression and decompression, I've made changes to the way state_video)compress_j2k and state_video_decompress_j2k classes are constructed, throwing exceptions if they are unable to be initialized. This has removed the need for goto statements and explicit delete of the state_j2ks by the j2k_compress_init() and j2k_decompress_init() functions.

Parsing of compression options have been offloaded to private member, void parse_fmt(const char*) in struct state_video_compress_j2k.

Parsing of decompression parameters has been offloaded to private member void parse_params() in struct state_video_decompress_j2k.

If help is called during -c cmpto_j2k:help, the HelpRequested() exception will throw and be caught by j2k_compress_init(), resulting in static_cast<module*>(INIT_NOERR)

Once parsing has completed, private members initialize_j2k_enc_ctx() and initialize_j2k_dec_ctx() are called and attempt to complete the final initialization of the J2K context.

Any errors during this process will result in a thrown exception that will be caught by j2k_compress_init() and j2k_decompress_init(), resulting in a NULL return.

Compress Exceptions Include:

HelpRequested()
InvalidArgument()
UnableToCreateJ2KEncoderCTX()

Decompress Exceptions Include:

UnableToCreateJ2KDecoderCTX()

To account for differences in the video_frame_pool initialization when using CPU and CUDA platforms during cmpto_j2k compression, the video_frame_pool pool has been changed to std::unique_ptr<video_frame_pool> pool to allow for reconfiguration during state_video_compress_j2k construction.

Changes to `j2k_compress_init` and `j2k_decompress_init`

Since the j2k_compress and j2k_decompress constructors will throw if there is an error with any of the options or parameters, _init functions are responsible for catching those errors and returning what is needed by the module.

New j2k_compress_init implementation

static struct module * j2k_compress_init(struct module *parent, const char *opts) {
        try {
                auto *s = new state_video_compress_j2k(parent, opts);
                return &s->module_data;
        } catch (HelpRequested const& e) {
                return static_cast<module*>(INIT_NOERR);
        } catch (InvalidArgument const& e) {
                return NULL;
        } catch (UnableToCreateJ2KEncoderCTX const& e) {
                return NULL;
        } catch (...) {
                return NULL;
        }
}

New j2k_decompress_init implementation

static void * j2k_decompress_init(void) {
        try {
                auto *s = new state_video_decompress_j2k();
                return s;
        } catch (...) {
                return NULL;
        }
}

Additions of enums and structs to allow easy selection of platform to use and future expansion

enum j2k_compress_platform and enum j2k_decompress_platform have been created to allow easy selection of the platform to use. These options currently include:

NONE
CPU
CUDA

enum j2k_compress_platform {
        NONE = 0,
        CPU = 1,
#ifdef HAVE_CUDA
        CUDA = 2,
#endif // HAVE_CUDA
};

enum j2k_decompress_platform {
        NONE = 0,
        CPU = 1,
#ifdef HAVE_CUDA
        CUDA = 2,
#endif // HAVE_CUDA
};

struct j2k_compress_platform_info_t has been created to hold a friendly name (ex. "cpu") and its corresponding j2k_compress_platform type.

struct j2k_compress_platform_info_t {
    const char* name;
    j2k_compress_platform platform;
};

auto cpu = j2k_compress_platform_info_t{"cpu", j2k_compress_platform::cpu};

The function j2k_compress_platform get_platform_from_name(std::string) has also been added with the purpose of searching constexpr auto compress_platforms = array{}; for friendly names and returning an associated platform.

// Supported Platforms for Compressing J2K
constexpr auto compress_platforms = std::array {
    j2k_compress_platform_info_t{"none", j2k_compress_platform::NONE},
    j2k_compress_platform_info_t{"cpu", j2k_compress_platform::CPU},
#ifdef HAVE_CUDA
    j2k_compress_platform_info_t{"cuda", j2k_compress_platform::CUDA}
#endif
};

static j2k_compress_platform get_platform_from_name(std::string name) {
    std::transform(name.cbegin(), name.cend(), name.begin(), [](unsigned char c) { return std::tolower(c); });

    auto matches = [&name](const auto& p) { return name.compare(p.name) == 0; };

    if (const auto& it = std::find_if(compress_platforms.begin(), compress_platforms.end(), matches) ; it != compress_platforms.end()) {
        return it->platform;
    }

    return j2k_compress_platform::NONE;
}

These enums and struct are built with the consideration that OpenCL, since it's also supported by Comprimato, can be implemented in the future.

General Changes

Previously anonymous structs containing information about the ugcodec and associated CMPTO codec have been named struct Codec and codec information for both compress and decompress have been changed to constexpr arrays.


// example from video_compress/cmpto_j2k.cpp
struct Codec {
    codec_t ug_codec;
    enum cmpto_sample_format_type cmpto_sf;
    void (*convert)(unsigned char *dst_buffer, unsigned char *src_buffer, unsigned int width, unsigned int height);
};

constexpr auto codecs = std::array{ Codec{UYVY, CMPTO_422_U8_P1020, nullptr}, Codec{v210, CMPTO_422_U10_V210, nullptr}, Codec{RGB, CMPTO_444_U8_P012, nullptr}, Codec{BGR, CMPTO_444_U8_P210, nullptr}, Codec{RGBA, CMPTO_444_U8_P012Z, nullptr}, Codec{R10k, CMPTO_444_U10U10U10_MSB32BE_P210, nullptr}, Codec{R12L, CMPTO_444_U12_MSB16LE_P012, rg48_to_r12l}, };


* The `<algorithm>` header has been included and `find_if` is now used to search for matching codec rather than range-based for loop with if statements in `j2k_decompress_reconfigure()` and `configure_with()`.
```C++
auto matches = [&](const Codec& codec) { return codec.ug_codec == desc.color_spec; };

if (const auto& codec = std::find_if(codecs.begin(), codecs.end(), matches) ; codec != codecs.end()) {
    /// rest of code
}

Standardized on using log_msg() for message reporting, rather than MSG()

log_msg(LOG_LEVEL_ERROR, "%s Failed to find suitable pixel format\n", MOD_NAME);

Removed some C-style casts and replaced with static_cast<T>
Made some #define statements constexpr
Removed some using directives

Refactored void print_dropped to be platform-aware in reporting hints

void print_dropped(unsigned long long int dropped, const j2k_decompress_platform& platform);

Moved #include, #define, constexpr, structs, functions, predeclarations around.
struct Opts created instead of using anonymous struct and split into General, CUDA, and CPU options. Availability of these options are determined with `#ifdef HAVE_CUDA# directives.

Testing

Testing of this implementation has been done on:

Ubuntu 22.04LTS using Comprimato SDK 2.8.0 and 2.8.1
MacBook Pro (M1), running MacOS 14.4 using Comprimato SDK v2.8.0.

Hi, I would have some remarks to the pull request.

Most importantly, the pull request is a bit huge compared to functional changes - it almost rewrite both files (2/3 of code is touched), even the unrelated parts of code. Also the commits can be easily squashed into one, because there is first big commit and then just smaller changes of this one.

As there are together functional changes, refactoring and code movement, it is really hard to check what has been exactly changed, so it almost cannot be audited for regressions etc. (I personally usually separate even refactoring and functional commits.) Would it be possible to reduce the extent? You can look (or use) eg. this commit - it modifies your changes in a way that it at least leaves the code in place and reverts few unrelated changes. The advantage should be obvious if you look at git diff, that your changes are more readable and can be reviewed.

For me will be acceptable if you'd squash all to a single commit (including the suggested changes). If you'd then like to do some further changes, I'd perhaps prefer separate pull requests. I'll sum up my opinions about the non-essential changes:

struct Codec - actually I don't see a reason for creating a named type, except than for also std::array. Well, slightly ugly thing about array is that in order to template arguments to be deduced, the aggregate type name must be usually given explicitly in the initializer list
find_if - why not, on the other hand, there is no big advantage of the change
log_msg - well, as for functional change, I'll perhaps see as more important to consistently use the module prefix string; I don't see much bad about MSG() - it actually enforces the use of that prefix
c-style casts - well, maybe, but it is not worth doing the change
moving the stuff around - well, from my perspective definitely not if done "by the way" mixed with other changes and not done in a separate commit, because it hides the actual changes done
remove using directives - well maybe, I'd agree with using whole std namespace but actually I see using std::something rather as a matter of opinion than a problem that should be fixed

Last, I see problematic the conditional compilation with HAVE_CUDA - in the original code the only CUDA depending code is the CUDA host allocator. Actually the proposed pull request requires CUDA toolkit for the CUDA codec variant to be used. That was actually not required before - without CUDA, the codec worked (just the buffers was not in the DMA transfer capable region). I don't know exactly our uses' use cases, but I can imagine that someone compiles UG without having the CUDA toolkit - it may not introduce much performance penalty. But don't worry about this particular case, I can fix this after the merge by myself.

Hi @MartinPulec!

Thanks for taking the time to review this pull request. This is my first one, so my apologies if it's a bit unconventional. I'll be sure to keep your notes in mind, regarding separating functional and refactoring changes, as well as squashing minor changes into one commit, for the next PR.

I read through your commit and agree, it's significantly more readable and rolls back some of the non-essential changes. I'm happy to use that as my starting point to re-review, test, and submit if you'd prefer I do that.

I had one question that I wanted your input on since I was going back-and-forth with myself about it. Do you agree with :platform=<cpu,cuda> or should it be named something else? I'm not tied to the platform keyword, but it's what made sense to me when I was testing out how best to name and implement.

Regarding the HAVE_CUDA conditional, thanks for catching that. Looking back, all of my testing on Linux used systems with the CUDA toolkit, so I missed that part of my testing. If you'd prefer to fix that after the merge, I can leave all those conditionals as-is.

In terms of the non-essential changes, here's some of the reasoning with why I made those changes. I'm happy to submit separate pull requests if you think any of them really make much of a difference -- don't want to waste your time submitting non-important PRs.

find_if - No specific reason other than I thought easier to read than the range-based for loop when combined with a lambda. Agreed, not particularly important!
log_msg - I used libavcodec.cpp in compress and decompress as my reference since it was the most recently updated at the time of me working on this PR. I noticed that log_msg was used more than MSG(), so I standardized on that. I'm not particularly tied to this change, just wanted to try and use something that seemed a bit more recent and standardized. Using %s for the module prefix is nice because it reduced chances I'd mistype and put [CMPTO J2K enc] instead of [CMPTO J2K dec], but I can definitely see benefits to enforcing the prefix usage.
using std::* - Other than the namespace std, there was no reason I removed these other than it being a matter of opinion. Not particular important either!

It is totally fine, I believe that your code is great, but the only problem is really that I want to leave the things traceable, for which it is good to separate functional and non-functional changes, I believe. For the other changes it depends - some are ones that I'd agree, the others I would as well, but I would change it just when changing that code anyways.

I had one question that I wanted your input on since I was going back-and-forth with myself about it. Do you agree with :platform=<cpu,cuda> or should it be named something else? I'm not tied to the platform keyword, but it's what made sense to me when I was testing out how best to name and implement.

I believe that the keyword platform is totally fine. I've just looked into the headers and at Comprimato, they denote this as "technology" (CMPTOTECHNOLOGY{CPU,CUDA}. But both words sound more or less equivalent in this content.

Regarding the HAVE_CUDA conditional, thanks for catching that. Looking back, all of my testing on Linux used systems with the CUDA toolkit, so I missed that part of my testing. If you'd prefer to fix that after the merge, I can leave all those conditionals as-is.

I have no problem doing it by myself but I don't insist on that. Anyways, if you decide to do it by yourself, you can look at the above code comment.

find_if

Sure - I've no particular objections against that. If you wish, I have no problem but it will be nicer if it was in a separate commit. Anyways, just in a case, it doesn't need the std::array necessarily, std::begin/std::end to with the C-array as the argument would work as well.

log_msg - I used libavcodec.cpp in compress and decompress as my reference since it was the most recently updated at the time of me working on this PR. I noticed that log_msg was used more than MSG(), so I standardized on that. I'm not particularly tied to this change, just wanted to try and use something that seemed a bit more recent and standardized. Using %s for the module prefix is nice because it reduced chances I'd mistype and put [CMPTO J2K enc] instead of [CMPTO J2K dec], but I can definitely see benefits to enforcing the prefix usage.

Ok, well, no serious problem with this, indeed. But just a small explanation - I've actually started using MSG() later on - it is actually a macro and if you look at its definition, it actually enforces the module name implicitly. ~~On the other hand, it also forces the MOD_NAME to be the C string literal.~~ fixed

I understand that using the macro(s) is slightly controversial in C++ (even these MOD_NAME, which clang-tidy wants to be replaced with a constexpr var), so I don't insist on that, either.

using std::* - Other than the namespace std, there was no reason I removed these other than it being a matter of opinion. Not particular important either!

Sure, I've consulted it with a colleague on Friday and he also advocated the use of the identifiers including the std namespace. We may consult it later to unify the style but it just haven't been so far (UltraGrid is written in C historically and the C++ coding standards are not much defined).

To sum up the points 1-3, as there is no rule defined, feel free to choose. I also understand that you may want to make this consistent across the file. Provided that it will be in a separate commit, we'd accept it.

Hi @MartinPulec!

I've made the requested changes, if you'd like to review them and provide feedback. Here's a summary of what has been done.

Squashed all previous commits to Initial cmpto implementation
Pulled your commit into my branch and used that as a starting point for all changes moving forward.
Implemented cmpto_version::technology checks via a new function bool supports_cmpto_technology(int) and replaced #ifdef HAVE_CUDA directives in all cases except for cmpto_j2k_enc_cuda_host_buffer_data_allocator
Standardized on using MSG() instead of log_msg() for message reporting in most cases.
Renamed using allocator = in video_compress file to using cuda_allocator = to match using cpu_allocator = naming and to be explicit about what type of allocator is being used when std::unique_ptr<video_frame_pool> is initialized.

One thing I noticed during the MSG() refactor is that fmt is using "%s", which means that MSG() calls will require a space at the head of the fmt string to print a space between MOD_NAME and the rest of the arguments. Should I add space to the head of all the MSG() calls, or is there a chance "%s" will be changed to "%s " in the MSG() call?

MSG(INFO, "no space at head") = [Cmpto J2K enc.]no space at head
MSG(INFO, " space at head") = [Cmpto J2K enc.] space at head

Thanks!

CESNET / UltraGrid