alecazam commented 2 years ago

Trying to move from astc-encoder 2.5 to 3.4. Everything went pretty smoothly on the code update side. But ParallelManager class is declared inline in the astc_context, but it's contains a mutex that needs to be initialized properly. Also I'm only using a thread count of 1, so the ParallelManager likely gets little parallel usage.

None of these are initialized correctly.


#if !defined(ASTCENC_DECOMPRESS_ONLY)
    /** @brief The parallel manager for averages computation. */
    ParallelManager manage_avg;

    /** @brief The parallel manager for compression. */
    ParallelManager manage_compress;
#endif

         /** @brief The parallel manager for averages computation. */
    ParallelManager manage_decompress;

I only see the code to an aligned_malloc on astcContext, and no placement new.

Each ParallelManager has the following that aren't correctly initialized by a memclear or whatever is used now.

/** @brief Lock used for critical section and condition synchronization. */
    std::mutex m_lock;

/** @brief Contition variable for tracking stage processing completion. */
    std::condition_variable m_complete;

    /** @brief Number of tasks started, but not necessarily finished. */
    std::atomic<unsigned int> m_start_count;

This is just going to leave the Mutex invalid. Maybe other classes too don't run any initalizers.

astcenc_context* ctx = aligned_malloc<astcenc_context>(sizeof(astcenc_context), ASTCENC_VECALIGN);

On macOS, the code dies hard trying to reference this with an invalid data on the lock. This data structure needs to be a pointer, and allocated with new/delete or placement new to construct the std::mutex.

         // Wait for compute_averages to complete before compressing
    ctx->manage_avg.wait();  <-  dies here

alecazam commented 2 years ago

Also aligned_malloc really isn't needed anymore. Only Windows x86 still default aligns to 8-bytes, and I don't recall Linux. Android also defaults to 16-byte alignment. Apple adopted 16-byte alignment for all, even AVX and AVX2 which can align higher, but they can handle non-simd aligned loads (meaning 16 byte instead of 32 or higher for AVX-512). With SVE coming out too, would be good to adjust for that.

This avoids the need for all the vec align stuff.

alecazam commented 2 years ago

Also seem to be bugs like calling delete instead of aligned_free() on the ctx. But I'm just going to switch it to new, and use ASTCENC_VECALIGN of 16 (not 32). I'll end up keeping this delete, since I'm switching it to using new for the context. There's also a working_buffers object using aligned_malloc/free, but I don't think that has any classes in it.

if (status != ASTCENC_SUCCESS)
    {
        delete ctx; <- line 726, astcenc_entry.cpp
        return status;
    }

alecazam commented 2 years ago

Here's the changelist in kram for this if it helps. This fixed the crashes on the mutex using the context. I didn't test the decoder, but I'm sure it has similar crashes from the ParallelManager used there.

https://github.com/alecazam/kram/commit/5046bcac79aa51b74bbc0f4af80f9e80cee451ae

alecazam commented 2 years ago

Also seeing random block artifacts all over my color/albedo textures. My normal maps are etc2_rg, and don't use astc, so they're okay. I've got the ASTCENC_VECALIGN forced to 16 bytes, but I don't think that's the issue since this is arm64 on an M1. I may have to try with 3.4 instead of the bleeding edge encoder. v2.5 generated good output except at low values.

I'm running this on M1 macOS, so it's taking the arm64 path. I haven't looked at Intel side.

solidpixel commented 2 years ago

Thanks - will take a look at this one later. Surprised the mutex part here hasn't caused issues before - I've don't think I've touched that since 2.something.

For the image artefacts, do you have a complete repro you can share in terms of input texture and API config parameters?

solidpixel commented 2 years ago

Also aligned_malloc really isn't needed anymore. Only Windows x86 still default aligns to 8-bytes ...

Well, we still need to run on Windows.

Even if the instruction set supports it, aligned access is still normally faster than unaligned (often take a 1 cycle penalty if a load happens to span two cache lines).

alecazam commented 2 years ago

Windows also has clang, and x64 is already default 16-byte aligned. It's just x86 which is mostly phased out now that has the default 8-byte align, but most override global new/delete to 16-bytes there if doing SIMD. In my case, I'm using AVX, and all 128-bit. So I'm just using AVX for more registers, and don't want 32-byte alignment. Also Intel has disabled AVX-512 in Alder lake chips, so seems like a backpedal from larger and larger simd words there to reduce power consumption, and possibly devote more die to iGPU.

See simd/vector_types.h

In earlier versions of the simd library, the alignment of vectors could
 *  be larger than 16B, up to the "architectural vector size" of 16, 32, or
 *  64B, depending on what options were passed on the command line when
 *  compiling. This super-alignment does not interact well with malloc, and
 *  makes it difficult for libraries to provide a stable API, while conferring
 *  relatively little performance benefit, so it has been relaxed.

alecazam commented 2 years ago

This is a command line for kramc in my test project. This basically runs a 512x512 image down to 16x16 without mips. So it's 4x4 blocks.

I was running v2.5 prior on this same machine, and the images were correct. So maybe there's a setting I need to add, but it looks like something isn't reset properly, or the current code is taking a constant-block optimization that it shouldn't. Some blocks with artifacts though aren't just constant color, and have pixels partitioned or at least different colors but the final result is off.

encode -verbose -f astc4x4 -srgb -premul -quality 49 -mipmax 16 -mipmin 16 -type 2d -i /Users/Foo/kram/tests/src/brick01-d.png -o /Users/Foo/kram/tests/out/ios/brick01-d.ktx

This is the rgb, with the obvious defect. This is a red-brick image with grout.

Seems like constant blocks are sometimes used where they shouldn't be. This is all solid red when I view the red channel only in kramv. That seems to cause the discoloration on this test file.

solidpixel commented 2 years ago

Patch pushed to swap the context back to using new without any alignment requirements - now just use aligned_malloc() internally on a dynamically allocated block_size_descriptor instead. This solves the mutex initializer problem too.

alecazam commented 2 years ago

Synced to latest. Thanks for incorporating that. I'll try to run my images in your cli app tomorrow. It's getting late here, and I've been diving for a while into all this.

solidpixel commented 2 years ago

I've split the image failures off to a new issue (#316). Reproduced failures on NEON locally, when the x86_64 builds work fine, so this looks NEON specific.

solidpixel commented 2 years ago

OK - identified and fixed one issue in the NEON vector library (see the other issue for details). Can you update and give it a go?

alecazam commented 2 years ago

That fixed it, I'll respond on other thread. Also might want to have a non-mutex and non-condition_var path for the single-threaded path. My understanding is that these constructs can get threads swapped out, and sometimes spin-lock to avoid losing the thread. Feel free to close.

I also rarely use unique_lock mutexes, since you can't call through to functions that also lock or get a livelock easily. I redefine mutex and lock to the recursive variant, even though it has slightly more cost.

You also probably have a bigger corpus of images and timings but you might try the ASTCENC_VECALIGN set to 16 always vs. 32, and see if it really makes a perf difference. I haven't tested my Intel build with that yet. My work texture set is much larger, so I can time that this week.

//#if ASTCENC_AVX // #define ASTCENC_VECALIGN 32 //#else

define ASTCENC_VECALIGN 16

//#endif

solidpixel commented 2 years ago

The perimeter for this one is pretty well defined, with locks only on those outer trackers and only held to check-out or check-in a batch of work, so I don't have too many concerns about recursive access patterns causing problems.

Will take a look at skipping those if context thread-count is 1, shouldn't be too hard to add. Last time I tried the performance overhead of them was neglible though (mostly it's going to be an uncontended futex - the compression work in a task far exceeds the time in the critical section).

solidpixel commented 2 years ago

Will also give the vector alignment change a go.

Based on the AVX2 intrinsics guide, the aligned loads (e.g. _mm256_load_ps) still need 32 byte alignment, so try this will take some code mods to swap those for unaligned loads.

solidpixel commented 2 years ago

Closing this one, as the bug is fixed.

ARM-software / astc-encoder

Issues with mutex in ParallelManager #315

define ASTCENC_VECALIGN 16