K2 IS binary data reader

sk1p commented 6 years ago

[x] figure out guarantees (is block ordering deterministic?, are frame IDs for blocks really monotonic or can they "interlace" so to speak?, ...)
[x] figure out what data we need to discard ("desynced" frames at the beginning and end)
[x] integrate into LiberTEM as DataSet implementation
[x] split sectors / binary files into smaller partitions
[x] don't require direct mapping of
- [x] sectors to partitions
- [x] blocks to tiles
[ ] crop frames to user specified area (see comment by @woozey below, and #28 for the general case)
[x] optimize conversion uint12 -> uint16 (see #108)
[x] read tags from gtg file (for us most important: scan dimensions, see #109)

sk1p commented 6 years ago

Depending on how well we can handle the weird 12bit format and the shuffling of data blocks, we may need to convert a sensible format (hdf5, some kind of proto-nexus?) before processing.

sk1p commented 6 years ago

I got something slow but seemingly correct working. Still need to figure out the detailed guarantees of the format, and implement appropriate optimizations. Then, integrate into LiberTEM.

uellue commented 6 years ago

@sk1p Here are some CPU instructions that might help to make it fast:

General: https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets Testing if CPU has it: https://stackoverflow.com/questions/44827477/does-avx-support-imply-bmi1-support

AVX-512: https://software.intel.com/en-us/node/523829 BMI2: https://www.felixcloutier.com/x86/PEXT.html BMI1: https://www.felixcloutier.com/x86/BEXTR.html

What would be better, using the intrinsics directly or writing it in pure C/C++ and checking if the compiler actually generates efficient machine code?

uellue commented 6 years ago

A second point, efficiently unpacking n-bit (u)ints packed into a bitfield to 8-bit, 16-bit, 32 bit, 64 bit (u)ints could be a useful function for numpy and such. Apparently it doesn't exist yet, at least I couldn't find it. A quick Google only found https://stackoverflow.com/questions/37797779/python-reading-12-bit-packed-binary-image which looks horrendously slow.

uellue commented 6 years ago

How fast is this one? https://pythonhosted.org/bitstring/constbitarray.html#bitstring.Bits.unpack

sk1p commented 6 years ago

Reading packed integers with Python doesn't seem to be a frequent task, I also couldn't find much about it. I'll be looking at the performance of different techniques, my current list:

pure Python
bitstring
katai.struct (thanks to @jan-car for mentioning it! directly support bit-sized integers - may be overkill though...)
numba (need to look into generating ufuncs with numba, and see if I can make it apply them in a way that converts dtypes)
cython

uellue commented 6 years ago

Just for discussion an entirely untested piece of code that compiles nicely to SIMD instructions when optimized: https://godbolt.org/g/yMpb8X

Conclusion: Ingesting 8-byte chunks and processing them allows the compiler to vectorize the inner loop, apparently. Overlapping parts and offsets are carried along. Since the number of loops until input and output are aligned can be precalculated, there is no "if" statement but an inner loop with precalculated number of iterations and a small piece of code that handles the remaining piece of data. This helps the optimizer to generate compact machine code and use vector instructions, apparently.

TODO actually test this and compare it to other solutions. I don't have a build chain for C at the moment.

#include <inttypes.h>

#define CHUNK_T uint64_t

// should be taken from some std lib
// TODO find a version of gdc that the compiler can reduce to a constant
// Maybe lookup table for common values?

// int gcd(int a, int b) { 
// }

void decode_uint12(int size, char* restrict inp, uint16_t* restrict out) {
    const int bytes = sizeof(CHUNK_T);
    const int bytesize = 8;
    const int bits = 12;
    const int nums_per_chunk = bytes * 8 / bits;

    const CHUNK_T mask = 1 << (bits - 1);

    // Number of iterations after which the starting state is reached
    // TODO use a gcd function that the compiler can reduce to a constant
    const int stride = 3; // const int stride = bits / gcd(bytes * bytesize, bits);

    int o = 0;

    CHUNK_T chunk = 0;

    chunk = 0;

    // TODO off-by-one?
    for (int i = 0; i + bytes*stride <= size; i += bytes*stride) { 
        int shift = 0;
        for (int s = 0; s < stride; s++) {            
            // Load new data portion
            // TODO: How to read a CHUNK_T correctly from a char array?
            CHUNK_T tmp = *( CHUNK_T*)(inp + i + s*bytes);
            // prepend the next portion to remaining bits
            chunk &= (tmp << shift);
            for (int j=0; j < nums_per_chunk; j += 1) {
                out[o] = chunk & mask;
                o += 1;
                chunk >>= bits;
                // TODO it looks like the compiler can precalculate this, right?
                shift = ((shift + bits) % bits);
            }
        }        
        // shift == 0 means we still have a complete value in the chunk.
        // for debugging:
        // assert(shift ==0);
        // Load it so that it doesn't get overwritten
        out[o] = chunk & mask;
        o += 1;
    }
    // TODO handle unaligned rest of input here
}

sk1p commented 6 years ago

There are some prototype notebooks available at https://github.com/LiberTEM/LiberTEM/tree/master/prototypes/k2/notebook - note that the resulting HDF5 file of the conversion notebook is a 3D dataset, which LiberTEM can't currently read as-is. Also, the large frame size revealed some rather comically grave bugs, which I'm now in the progress of fixing.

woozey commented 6 years ago

Is my understanding correct that data blocks are not sorted at all?

sk1p commented 6 years ago

@woozey as I understand it, one ordering guarantee is that the frame id is monotonic (modulo the "garbage" blocks at the beginning), that is, data blocks that belong to one frame are not "interleaved" with data blocks from different frames.

Other than that, maybe have a look at pixel_x_start and pixel_y_start values, maybe there is a fixed pattern, but in the spec there is no explicit guarantee, as far as I can see.

woozey commented 6 years ago

with the reference to: https://github.com/LiberTEM/LiberTEM/blob/master/prototypes/k2/notebook/K2%20IS%20v2.ipynb

fig, axes = plt.subplots()
axes.imshow(frame_res, cmap="gist_earth", vmin=10., vmax=20.)

screen shot 2018-08-16 at 17 32 27

shows that the a segment [:, 1920:] always has fixed values of 16. This part of the frame can be cropped away.

sk1p commented 6 years ago

True, there are also supposed to be 4 rows at the beginning that contain non-image data, but I didn't find evidence of that yet.

woozey commented 6 years ago

there are also supposed to be 4 rows at the beginning that contain non-image data, but I didn't find evidence of that yet.

me neither, but I'll check the other data sets and will let you know if I find something

woozey commented 6 years ago

And there we are: screen shot 2018-08-16 at 18 33 22 so everything within [1796:, :1919] seem to be just a readout noise. The GMS outputs 1792x1920 images. In any case I wouldn't hardcode the dimensions of the output images as I can imaging that this noise area may be useful for calibration of readout channels... @sk1p in case you want to have a closer look at this dataset, it is located in /storage/holo/Vadim/2017-05-08_IS_tests/06_holo/holo3

woozey commented 6 years ago

I haven't find anything in the beginning, though...

sk1p commented 6 years ago

Nice! I wonder if the spec is wrong, or maybe we are looking at the wrong spec: the spec says version should be 2, in the dataset I looked at it was 1.

sk1p commented 6 years ago

In the .gtg file, there are some related values:

 'Inter-Frame Row Padding': 66,
 'Pre-Frame Row Padding': 4,

So even in 400 fps mode, there is this 66px padding for each super-frame at the bottom, as specified for the "faster" acquisition

There is also this:

  'Padded Frame Count': 10,

Together with this...

In [24]: reader.tags_dict['SI Dimensions']
Out[24]: {'Size X': 35, 'Size Y': 34}

... the total frame count at the end should be 35 * 34 + 10

As a short aside: this is how I opened the gtg file with hyperspy:

In [8]: path = "/home/clausen/Data/K2-IS/Capture52/Capture52_.gtg"
In [9]: from hyperspy.io_plugins.digital_micrograph import DigitalMicrographReader
In [10]: f = open(path, "rb")
In [11]: reader = DigitalMicrographReader(f)
In [12]: reader.parse_file()
In [13]: reader.tags_dict
{'root': {},
 'Acquisition': {'Device': {'Active Size (pixels)': (9.486e-321, 8.854e-321),
   'Camera Number': 1,
[...]

sk1p commented 6 years ago

Btw. the conversion notebook is not correct yet due to different frame starting offsets per sector not being taken into account:

selection_032

At least that's what I think is happening here.

sk1p commented 6 years ago

There now is experimental support for loading K2IS data. Experimental means:

it is still slow: takes ~15s to process 8GB, where it should take <1s (could be ~25x faster)
it does not yet read the .gtg files, so you need to specify scan dimensions manually
it does not support partial blocks at the beginning of the file (it should already ignore anything after the specified amount of frames)
it currently uses a direct mapping of sectors to partitions and blocks to tiles, this restriction can be lifted later

To use it, just select any of the .bin files in the file browser that belong to your dataset.

@woozey if you want to test on moellenstedt, make sure to start LiberTEM with environment variables OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 to disable native threading of BLAS libraries; and pip install torch (pytorch in conda) for additional performance (torch is currently an optional dependencies as it pulls in quite large binaries, for example the whole of mkl...)

sk1p commented 6 years ago

Also, no cropping is done yet, this is still TODO.

sk1p commented 6 years ago

About the poor performance: One thing to keep in mind: a mask of 1792x1920, let's say float32, is already ~13MB large, and does not fit into L3 cache, not even speaking about the data. As soon as the reader supports tiles that don't match data blocks, we can test with tileshapes that stack in a way that better fits the L3 cache, for example (1, 7, 256, 256):

In [17]: num_masks, stackheight, num_cores = 1, 7, 4
In [18]: ((num_masks + stackheight) * (256 * 256 * 4 * num_cores)) / 1024
Out[18]: 8192.0

It would be helpful to separate two influences: the slowness of the K2IS reader, and the slowness of LiberTEM with large frames. This can be done by converting a K2IS dataset to a raw file, which currently performs best in LiberTEM.

sk1p commented 6 years ago

Performance Comparison (note: these numbers were obtained on a 4 core Xeon E3-1505M):

K2IS reader: 15s for 8GB
converted to raw, dtype=uint16: ~3.7s (delegating the tiling to BLAS)
converted to raw, dtype=uint16: ~11s (with tileshape=(1, 1, 930, 16))
converted to raw, dtype=uint16: ~1.7s (with tileshape=(1, 16, 8, 2048))
converted to raw, dtype=uint16: ~4.5s (with tileshape=(1, 1, 1860, 2048))
converted to raw, dtype=float32: ~1.5s (delegating the tiling to BLAS)
converted to raw, dtype=float32: ~1.7s (tileshape=(1, 16, 8, 2048))

Dataset converted to uint16 takes up 8.5GB, converted to float32 ~17GB.

Reading and processing with the same tileshape as the original data blocks had is already quite slow.

Ideas for performance improvements:

Fold the dtype conversion into the inner loop of the uint12 → uint16 conversion. That way we should be able to re-use values that are already in L1 cache or even registers
Read in larger units. The data blocks are quite small and I can imagine this causes some overhead in Python-land. Could try to read 32 blocks at a time, that is, one whole sector of a frame.

As an aside: when using integer datasets, we currently convert on the fly to something the BLAS libraries can use, in this case float32. Then choosing a good tileshape becomes important: if we chose something that doesn't fit the L3 cache (as is the case if we delegate the tiling to BLAS), the whole operation takes twice as long.

woozey commented 6 years ago

I've lost a track a little... Sorry, had no time even to read the thread. I've seen the stuff you put into libertem.io.dataset.k2is. Does it take already into account the frame starting offsets? I'm trying to use it now with one of the example notebooks. I miss a way of visualizing the data, i.e. check diffraction patterns for given positions, any hints?

woozey commented 6 years ago

Also I wonder how to raw data out of it? I tried get_raw_data() method, but it returns something else...

sk1p commented 6 years ago

Yes, it takes into account the starting offsets. If you enable debug logging, you should get something like this:

DEBUG:libertem.io.dataset.k2is:first_block_offsets #1: [357760, 357760, 424840, 424840, 0, 0, 0, 0]
DEBUG:libertem.io.dataset.k2is:have_overlap, finding next frame
DEBUG:libertem.io.dataset.k2is:first_block_offsets #2: [536640, 536640, 603720, 603720, 178880, 178880, 178880, 178880]
DEBUG:libertem.io.dataset.k2is:unique frame ids found: 400
DEBUG:libertem.io.dataset.k2is:finding frames with frame_id < 284799
DEBUG:libertem.io.dataset.k2is:have non-sequential frames, setting new offsets
DEBUG:libertem.io.dataset.k2is:first_block_offsets #3: [137916480, 137916480, 137983560, 137983560, 137558720, 137558720, 137558720, 137558720]

Using the new simple API, the following should give you raw data (single frames or a slice of many frames):

from libertem.job.raw import PickFrameJob
from libertem.common.slice import Slice
# pick two full frames, starting at scan position (0, 0)
job2 = PickFrameJob(dataset=ds, slice_=Slice(origin=(0, 0, 0, 0), shape=(1, 2, 1860, 2048)))
res = ctx.run(job2)  # res is a numpy array with shape as specified above
plt.figure()
plt.imshow(res[0, 0])

The get_raw_data stuff returns tiled data, that is actually a leftover from when I implemented the frame picking stuff. I'll remove it to prevent any more confusion.

Getting at the raw data is this complicated, as it must also work in a cluster configuration, where different parts of a frame may even be stored on different nodes.

I'm open to add a simpler API in libertem.api.Context for getting at the raw data, I was thinking something along the lines of:

ctx.get_raw_data(dataset=ds)[0, 0, 0:128, 0:128]

where get_raw_data returns a sliceable object that, when sliced, submits a PickFrameJob object as a job as in the above snippet. Note that accessing data this way is quite slow, so if you find yourself calling that stuff in a loop, let's meet up and discuss.

woozey commented 6 years ago

Great! Thanks! Will try it!

sk1p commented 6 years ago

Regarding uint12 decoding, see also the following from the field of compression:

"Decoding billions of integers per second through vectorization" https://arxiv.org/abs/1209.2137 (most interesting: chapter 4, fast bit unpacking; and chapter 7, discussion)
- C w/ SIMD implementations of the ideas of that paper: https://github.com/lemire/simdcomp License: BSD; they claim up to 15GB/s throughput
- C implementation for smaller blocks: https://github.com/lemire/LittleIntPacker
- SIMD bit unpacking in https://github.com/powturbo/TurboPFor

Main difference between @uellue's and simdcomp approach seems to be horizontal vs. vertical layout and relying on the compiler's auto-vectorization vs. hand-written with intrinsics. What needs to be seen is whether they support the byte ordering used by the K2 data. See core of implementation with sse, avx, and avx512

Unrelated to the question of decoding K2 data, these compression techniques are really interesting, they should actually work quite well with our data sets, while still allowing for fast decompression.

uellue commented 6 years ago

@sk1p did you see anywhere a specification of what these algorithms do, i.e. which bit goes where? Fig. 4 in the PDF looks like just what we need, I'd just like to read somewhere "that is EXACTLY what this and this function does". Either "bit unpacking" is so trivial that it is self-explaining to them and it does what it should for us, or the spec is "We only guarantee that you get the same result if you compress and decompress with our code in the exact same version."

Looking at their messy source codes, I am growing fonder and fonder of modern compilers that work towards making monsters like that obsolete and (maybe, hopefully, some day?) achieve the same performance from clean and compact code.

Here's hoping that https://github.com/LiberTEM/LiberTEM/blob/master/prototypes/k2/uint12-decode/unpack-12-4.cpp would be fast, too! :-)

sk1p commented 6 years ago

@uellue my guess is that the code was completely generated by something like this, and the interface is simple enough to test things.

As for specification, I think trivial/self-explaining is the case. I will have a closer look tomorrow.

uellue commented 6 years ago

@sk1p yes, that is what I meant with "monster" and "messy": Having to write Python scripts that generate voluminous source code which is then compiled. Essentially that's people doing by hand what should be the job of compiler and optimizer. As far as I understand the current developments in C++ (and JIT-ing with numba etc), the goal is to make the language expressive enough and the compiler advanced enough to just write the whole thing in one language, let the compiler evaluate what corresponds to the "Python part" of the current system, and let it generate the "intermediate code" and assembly by itself. That's how I understand "constexpr": "Dear compiler, I wrote this bit of code specifically for you to evaluate at compile time. I also made sure that you know most of the arguments at compile time. Please inline here whatever comes out after reducing this to the remaining variable portion."

uellue commented 6 years ago

Actually, since always C/C++ were actually two (rsp. three) separate languages folded into one source code: Preprocessor macros, template metaprogramming, and then the actual C/C++ code. Interesting that people resort to generating code with Python rather than using the built-in methods!

sk1p commented 6 years ago

With the changes in 291b890 we now process the 8GB dataset in ~3.4 seconds, which is about 2.4GB/s. We now read the data in a way that is perfect for our mask application algorithm, for the last factor of two or so I'll look into the decoding thing in more depth.

@uellue I guess they did it that way because it actually works consistently, across compilers, at time of the release. MSVC, for example, didn't have constexpr support in 2014.

uellue commented 6 years ago

@sk1p Congratulations! That's on four cores, and 2.4 GB/s of packed uints, right?

Yes, good point. Even today, the output of clang and g++ is vastly different for the different examples. I guess the world is still waiting for that silver bullet... Maybe bit packing and unpacking is a nice playground for language and compiler design, and benchmarking!

sk1p commented 6 years ago

Yeah, the raw packed dataset is about 8gb, and the 3.4s is as measured in the GUI. I wonder how much it improves by splitting each file into subpartitions and thus running on more workers. Also really want to start testing your C++ code!

uellue commented 6 years ago

NICE, number-for-number that's almost as fast as processing 32 bit ints, isn't it? Do you think with a fast unpacker it could actually be FASTER than unpacked values, similar to blosc and the likes?

sk1p commented 6 years ago

Compared to float32 we only need to read 37.5% of data from the disk/NAS/FS cache... also we save quite some FS cache memory by reading the data compressed. So we may win some time that is spent by shoveling data from FS cache to GEMM when working on floats directly - I'm really curious, too!

sk1p commented 6 years ago

Sectors are now split into subpartitions: dd5c52fa731350fa144da9761e29a50777782403 - this should allow for more parallelism when reading and processing k2 files.

sk1p commented 6 years ago

With the increased parallelism we can now process the ~8GB dataset in 1.8s on a beefy 18 core machine.

sk1p commented 6 years ago

I managed to successfully load the 02_0V dataset (202GB) by @woozey in LiberTEM by copying it to the local harddisk of moellenstedt. It works quite well actually - applying a single mask takes about 34s once the data is in RAM.

There is still weirdness going on - the scan ordering is somehow messed up:

selection_042

This is what @kruzaeva also saw in her analysis as this weird jump in the data.

The other thing was about overlaps, which can be seen here:

02_0V x=56 y=59

This is at x=56 y=59 so directly next to the "weird jump", so I'll have to debug some more, check other datasets etc...

woozey commented 6 years ago

Indeed, the left part of the plot before the jump should be actually on the right side... Otherwise it looks fine. And there is now empty frames at the top as we have seen with the small test dataset.

@kruzaeva, for the moment you may use this region for your testing purposes: 44858710-e9967100-ac72-11e8-93b4-a9c744add412

woozey commented 6 years ago

Just in case, black part at the left bottom corner is a non-transparent counter electrode - it isn't an artefact.

sk1p commented 6 years ago

Indeed, there are some frames at the beginning we need to skip. This is my first try (cropped a little, to better fit available RAM):

selection_047

Now, I more or less manually skipped the first 56 frames. So this is still a syncronization problem. And 56 is not perfect either: the first column appears to be wrong.

selection_050

But if we skip 57 frames instead:

selection_048

Now the last column appears wrong.

selection_049

Looking at a frame in that region (x=290, y=40 while skipping 57 pixels at the beginning) we get these famous overlapping patterns:

selection_051

Is it possible that these are actually in the file and result from scanning/synchronization errors?

sk1p commented 6 years ago

Here is a screenshot with skip_frames=57 on the 02_0V dataset:

selection_052

Now the question remains: how to automatically find out how many frames to skip? I had a look into the gtg file, and did not find anything that could be the number of frames to skip, either directly, by frame id or by block index.

How I currently syncronize: first, I synchronize sectors such that each incomplete frame is skipped over and I find offsets such that all sectors start with the same frame id. Then I look at the next N frames (where N is currently set to 400) and see if the frame id makes a jump backwards (example: in one dataset, the frames start at ~284799, but about 190 frames later, they reset to 2 and count up again). Until now, I assumed that directly after this jump, the "real" data starts, but it appears that this is not the case. Permalink to the sync code.

@woozey do you think you could ask your contact at Gatan how this is supposed to work, and point them at the last few comments of this issue?

uellue commented 6 years ago

@sk1p It might very well be that we are looking at an issue with the K2 in combination with the scan unit (@woozey that's a Digiscan, right?). As far as I know, using the K2 like this is not its main application and a bit experimental. For that reason it would be good if we can use the LiberTEM GUI for debugging and tuning. In general, that's a useful application when setting up a combination of camera and scan unit.

What would be the best way to give users an interface to skip the right number of frames? With the GUI we can look at the scanned image and the individual frames. Could it be possible to first load and display as much data as is in the file and let the user decide where to crop for further processing? I mean, the EMPAD has the same effect of having some "extra" data in the file, right?

Maybe it would make sense to keep sth like the "load dataset" dialogue open above the analyses? A user can make changes to the settings and LiberTEM updates the current analyses resp. grays them out for the user to press "Apply" again?

uellue commented 6 years ago

@woozey @sk1p The last column is probably the "flyback" when the beam moves from the end of one line to the beginning of the next, right? I guess that could generally be skipped?

sk1p commented 6 years ago

Something to note about a possible C/C++ implementation of uint12 decoding: reinterpreting char as uint32 is not safe in all cases. See also:

http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html

sk1p commented 6 years ago

I split off the uint12 decoding and gtg reading into separate issues, so we can concentrate on the remaining "correctness" and synchronization issues here.

sk1p commented 6 years ago

@uellue you were at the Gatan booth at IMC, right? Did you ask about the frame skipping stuff?

uellue commented 6 years ago

@sk1p Yes! :-) I have on my agenda today to send an e-mail about that.

sk1p commented 6 years ago

@woozey can you confirm 1) that the data was acquired in IS mode (the gtg file says it was, any other way to find out?), 2) what firmware version is/was installed on the K2?

LiberTEM / LiberTEM

K2 IS binary data reader #63