Closed sk1p closed 5 years ago
Depending on how well we can handle the weird 12bit format and the shuffling of data blocks, we may need to convert a sensible format (hdf5, some kind of proto-nexus?) before processing.
I got something slow but seemingly correct working. Still need to figure out the detailed guarantees of the format, and implement appropriate optimizations. Then, integrate into LiberTEM.
@sk1p Here are some CPU instructions that might help to make it fast:
General: https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets Testing if CPU has it: https://stackoverflow.com/questions/44827477/does-avx-support-imply-bmi1-support
AVX-512: https://software.intel.com/en-us/node/523829 BMI2: https://www.felixcloutier.com/x86/PEXT.html BMI1: https://www.felixcloutier.com/x86/BEXTR.html
What would be better, using the intrinsics directly or writing it in pure C/C++ and checking if the compiler actually generates efficient machine code?
A second point, efficiently unpacking n-bit (u)ints packed into a bitfield to 8-bit, 16-bit, 32 bit, 64 bit (u)ints could be a useful function for numpy and such. Apparently it doesn't exist yet, at least I couldn't find it. A quick Google only found https://stackoverflow.com/questions/37797779/python-reading-12-bit-packed-binary-image which looks horrendously slow.
How fast is this one? https://pythonhosted.org/bitstring/constbitarray.html#bitstring.Bits.unpack
Reading packed integers with Python doesn't seem to be a frequent task, I also couldn't find much about it. I'll be looking at the performance of different techniques, my current list:
Just for discussion an entirely untested piece of code that compiles nicely to SIMD instructions when optimized: https://godbolt.org/g/yMpb8X
Conclusion: Ingesting 8-byte chunks and processing them allows the compiler to vectorize the inner loop, apparently. Overlapping parts and offsets are carried along. Since the number of loops until input and output are aligned can be precalculated, there is no "if" statement but an inner loop with precalculated number of iterations and a small piece of code that handles the remaining piece of data. This helps the optimizer to generate compact machine code and use vector instructions, apparently.
TODO actually test this and compare it to other solutions. I don't have a build chain for C at the moment.
#include <inttypes.h>
#define CHUNK_T uint64_t
// should be taken from some std lib
// TODO find a version of gdc that the compiler can reduce to a constant
// Maybe lookup table for common values?
// int gcd(int a, int b) {
// }
void decode_uint12(int size, char* restrict inp, uint16_t* restrict out) {
const int bytes = sizeof(CHUNK_T);
const int bytesize = 8;
const int bits = 12;
const int nums_per_chunk = bytes * 8 / bits;
const CHUNK_T mask = 1 << (bits - 1);
// Number of iterations after which the starting state is reached
// TODO use a gcd function that the compiler can reduce to a constant
const int stride = 3; // const int stride = bits / gcd(bytes * bytesize, bits);
int o = 0;
CHUNK_T chunk = 0;
chunk = 0;
// TODO off-by-one?
for (int i = 0; i + bytes*stride <= size; i += bytes*stride) {
int shift = 0;
for (int s = 0; s < stride; s++) {
// Load new data portion
// TODO: How to read a CHUNK_T correctly from a char array?
CHUNK_T tmp = *( CHUNK_T*)(inp + i + s*bytes);
// prepend the next portion to remaining bits
chunk &= (tmp << shift);
for (int j=0; j < nums_per_chunk; j += 1) {
out[o] = chunk & mask;
o += 1;
chunk >>= bits;
// TODO it looks like the compiler can precalculate this, right?
shift = ((shift + bits) % bits);
}
}
// shift == 0 means we still have a complete value in the chunk.
// for debugging:
// assert(shift ==0);
// Load it so that it doesn't get overwritten
out[o] = chunk & mask;
o += 1;
}
// TODO handle unaligned rest of input here
}
There are some prototype notebooks available at https://github.com/LiberTEM/LiberTEM/tree/master/prototypes/k2/notebook - note that the resulting HDF5 file of the conversion notebook is a 3D dataset, which LiberTEM can't currently read as-is. Also, the large frame size revealed some rather comically grave bugs, which I'm now in the progress of fixing.
Is my understanding correct that data blocks are not sorted at all?
@woozey as I understand it, one ordering guarantee is that the frame id is monotonic (modulo the "garbage" blocks at the beginning), that is, data blocks that belong to one frame are not "interleaved" with data blocks from different frames.
Other than that, maybe have a look at pixel_x_start
and pixel_y_start
values, maybe there is a fixed pattern, but in the spec there is no explicit guarantee, as far as I can see.
with the reference to: https://github.com/LiberTEM/LiberTEM/blob/master/prototypes/k2/notebook/K2%20IS%20v2.ipynb
fig, axes = plt.subplots()
axes.imshow(frame_res, cmap="gist_earth", vmin=10., vmax=20.)
shows that the a segment [:, 1920:] always has fixed values of 16. This part of the frame can be cropped away.
True, there are also supposed to be 4 rows at the beginning that contain non-image data, but I didn't find evidence of that yet.
there are also supposed to be 4 rows at the beginning that contain non-image data, but I didn't find evidence of that yet.
me neither, but I'll check the other data sets and will let you know if I find something
And there we are:
so everything within [1796:, :1919] seem to be just a readout noise. The GMS outputs 1792x1920 images. In any case I wouldn't hardcode the dimensions of the output images as I can imaging that this noise area may be useful for calibration of readout channels...
@sk1p in case you want to have a closer look at this dataset, it is located in /storage/holo/Vadim/2017-05-08_IS_tests/06_holo/holo3
I haven't find anything in the beginning, though...
Nice! I wonder if the spec is wrong, or maybe we are looking at the wrong spec: the spec says version should be 2, in the dataset I looked at it was 1.
In the .gtg
file, there are some related values:
'Inter-Frame Row Padding': 66,
'Pre-Frame Row Padding': 4,
So even in 400 fps mode, there is this 66px padding for each super-frame at the bottom, as specified for the "faster" acquisition
There is also this:
'Padded Frame Count': 10,
Together with this...
In [24]: reader.tags_dict['SI Dimensions']
Out[24]: {'Size X': 35, 'Size Y': 34}
... the total frame count at the end should be 35 * 34 + 10
As a short aside: this is how I opened the gtg file with hyperspy:
In [8]: path = "/home/clausen/Data/K2-IS/Capture52/Capture52_.gtg"
In [9]: from hyperspy.io_plugins.digital_micrograph import DigitalMicrographReader
In [10]: f = open(path, "rb")
In [11]: reader = DigitalMicrographReader(f)
In [12]: reader.parse_file()
In [13]: reader.tags_dict
{'root': {},
'Acquisition': {'Device': {'Active Size (pixels)': (9.486e-321, 8.854e-321),
'Camera Number': 1,
[...]
Btw. the conversion notebook is not correct yet due to different frame starting offsets per sector not being taken into account:
At least that's what I think is happening here.
There now is experimental support for loading K2IS data. Experimental means:
.gtg
files, so you need to specify scan dimensions manuallyTo use it, just select any of the .bin
files in the file browser that belong to your dataset.
@woozey if you want to test on moellenstedt
, make sure to start LiberTEM with environment variables OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1
to disable native threading of BLAS libraries; and pip install torch
(pytorch
in conda) for additional performance (torch
is currently an optional dependencies as it pulls in quite large binaries, for example the whole of mkl
...)
Also, no cropping is done yet, this is still TODO.
About the poor performance: One thing to keep in mind: a mask of 1792x1920, let's say float32, is already ~13MB large, and does not fit into L3 cache, not even speaking about the data. As soon as the reader supports tiles that don't match data blocks, we can test with tileshapes that stack in a way that better fits the L3 cache, for example (1, 7, 256, 256)
:
In [17]: num_masks, stackheight, num_cores = 1, 7, 4
In [18]: ((num_masks + stackheight) * (256 * 256 * 4 * num_cores)) / 1024
Out[18]: 8192.0
It would be helpful to separate two influences: the slowness of the K2IS reader, and the slowness of LiberTEM with large frames. This can be done by converting a K2IS dataset to a raw file, which currently performs best in LiberTEM.
Performance Comparison (note: these numbers were obtained on a 4 core Xeon E3-1505M):
K2IS reader: 15s for 8GB
converted to raw, dtype=uint16
: ~3.7s (delegating the tiling to BLAS)
converted to raw, dtype=uint16
: ~11s (with tileshape=(1, 1, 930, 16)
)
converted to raw, dtype=uint16
: ~1.7s (with tileshape=(1, 16, 8, 2048)
)
converted to raw, dtype=uint16
: ~4.5s (with tileshape=(1, 1, 1860, 2048)
)
converted to raw, dtype=float32
: ~1.5s (delegating the tiling to BLAS)
converted to raw, dtype=float32
: ~1.7s (tileshape=(1, 16, 8, 2048)
)
Dataset converted to uint16
takes up 8.5GB, converted to float32
~17GB.
Reading and processing with the same tileshape as the original data blocks had is already quite slow.
Ideas for performance improvements:
As an aside: when using integer datasets, we currently convert on the fly to something the BLAS libraries can use, in this case float32
. Then choosing a good tileshape becomes important: if we chose something that doesn't fit the L3 cache (as is the case if we delegate the tiling to BLAS), the whole operation takes twice as long.
I've lost a track a little... Sorry, had no time even to read the thread. I've seen the stuff you put into libertem.io.dataset.k2is
. Does it take already into account the frame starting offsets? I'm trying to use it now with one of the example notebooks. I miss a way of visualizing the data, i.e. check diffraction patterns for given positions, any hints?
Also I wonder how to raw data out of it? I tried get_raw_data()
method, but it returns something else...
Yes, it takes into account the starting offsets. If you enable debug logging, you should get something like this:
DEBUG:libertem.io.dataset.k2is:first_block_offsets #1: [357760, 357760, 424840, 424840, 0, 0, 0, 0]
DEBUG:libertem.io.dataset.k2is:have_overlap, finding next frame
DEBUG:libertem.io.dataset.k2is:first_block_offsets #2: [536640, 536640, 603720, 603720, 178880, 178880, 178880, 178880]
DEBUG:libertem.io.dataset.k2is:unique frame ids found: 400
DEBUG:libertem.io.dataset.k2is:finding frames with frame_id < 284799
DEBUG:libertem.io.dataset.k2is:have non-sequential frames, setting new offsets
DEBUG:libertem.io.dataset.k2is:first_block_offsets #3: [137916480, 137916480, 137983560, 137983560, 137558720, 137558720, 137558720, 137558720]
Using the new simple API, the following should give you raw data (single frames or a slice of many frames):
from libertem.job.raw import PickFrameJob
from libertem.common.slice import Slice
# pick two full frames, starting at scan position (0, 0)
job2 = PickFrameJob(dataset=ds, slice_=Slice(origin=(0, 0, 0, 0), shape=(1, 2, 1860, 2048)))
res = ctx.run(job2) # res is a numpy array with shape as specified above
plt.figure()
plt.imshow(res[0, 0])
The get_raw_data
stuff returns tiled data, that is actually a leftover from when I implemented the frame picking stuff. I'll remove it to prevent any more confusion.
Getting at the raw data is this complicated, as it must also work in a cluster configuration, where different parts of a frame may even be stored on different nodes.
I'm open to add a simpler API in libertem.api.Context
for getting at the raw data, I was thinking something along the lines of:
ctx.get_raw_data(dataset=ds)[0, 0, 0:128, 0:128]
where get_raw_data
returns a sliceable object that, when sliced, submits a PickFrameJob
object as a job as in the above snippet. Note that accessing data this way is quite slow, so if you find yourself calling that stuff in a loop, let's meet up and discuss.
Great! Thanks! Will try it!
Regarding uint12 decoding, see also the following from the field of compression:
Main difference between @uellue's and simdcomp approach seems to be horizontal vs. vertical layout and relying on the compiler's auto-vectorization vs. hand-written with intrinsics. What needs to be seen is whether they support the byte ordering used by the K2 data. See core of implementation with sse, avx, and avx512
Unrelated to the question of decoding K2 data, these compression techniques are really interesting, they should actually work quite well with our data sets, while still allowing for fast decompression.
@sk1p did you see anywhere a specification of what these algorithms do, i.e. which bit goes where? Fig. 4 in the PDF looks like just what we need, I'd just like to read somewhere "that is EXACTLY what this and this function does". Either "bit unpacking" is so trivial that it is self-explaining to them and it does what it should for us, or the spec is "We only guarantee that you get the same result if you compress and decompress with our code in the exact same version."
Looking at their messy source codes, I am growing fonder and fonder of modern compilers that work towards making monsters like that obsolete and (maybe, hopefully, some day?) achieve the same performance from clean and compact code.
Here's hoping that https://github.com/LiberTEM/LiberTEM/blob/master/prototypes/k2/uint12-decode/unpack-12-4.cpp would be fast, too! :-)
@uellue my guess is that the code was completely generated by something like this, and the interface is simple enough to test things.
As for specification, I think trivial/self-explaining is the case. I will have a closer look tomorrow.
@sk1p yes, that is what I meant with "monster" and "messy": Having to write Python scripts that generate voluminous source code which is then compiled. Essentially that's people doing by hand what should be the job of compiler and optimizer. As far as I understand the current developments in C++ (and JIT-ing with numba etc), the goal is to make the language expressive enough and the compiler advanced enough to just write the whole thing in one language, let the compiler evaluate what corresponds to the "Python part" of the current system, and let it generate the "intermediate code" and assembly by itself. That's how I understand "constexpr": "Dear compiler, I wrote this bit of code specifically for you to evaluate at compile time. I also made sure that you know most of the arguments at compile time. Please inline here whatever comes out after reducing this to the remaining variable portion."
Actually, since always C/C++ were actually two (rsp. three) separate languages folded into one source code: Preprocessor macros, template metaprogramming, and then the actual C/C++ code. Interesting that people resort to generating code with Python rather than using the built-in methods!
With the changes in 291b890 we now process the 8GB dataset in ~3.4 seconds, which is about 2.4GB/s. We now read the data in a way that is perfect for our mask application algorithm, for the last factor of two or so I'll look into the decoding thing in more depth.
@uellue I guess they did it that way because it actually works consistently, across compilers, at time of the release. MSVC, for example, didn't have constexpr support in 2014.
@sk1p Congratulations! That's on four cores, and 2.4 GB/s of packed uints, right?
Yes, good point. Even today, the output of clang and g++ is vastly different for the different examples. I guess the world is still waiting for that silver bullet... Maybe bit packing and unpacking is a nice playground for language and compiler design, and benchmarking!
Yeah, the raw packed dataset is about 8gb, and the 3.4s is as measured in the GUI. I wonder how much it improves by splitting each file into subpartitions and thus running on more workers. Also really want to start testing your C++ code!
NICE, number-for-number that's almost as fast as processing 32 bit ints, isn't it? Do you think with a fast unpacker it could actually be FASTER than unpacked values, similar to blosc and the likes?
Compared to float32 we only need to read 37.5% of data from the disk/NAS/FS cache... also we save quite some FS cache memory by reading the data compressed. So we may win some time that is spent by shoveling data from FS cache to GEMM when working on floats directly - I'm really curious, too!
Sectors are now split into subpartitions: dd5c52fa731350fa144da9761e29a50777782403 - this should allow for more parallelism when reading and processing k2 files.
With the increased parallelism we can now process the ~8GB dataset in 1.8s on a beefy 18 core machine.
I managed to successfully load the 02_0V
dataset (202GB) by @woozey in LiberTEM by copying it to the local harddisk of moellenstedt. It works quite well actually - applying a single mask takes about 34s once the data is in RAM.
There is still weirdness going on - the scan ordering is somehow messed up:
This is what @kruzaeva also saw in her analysis as this weird jump in the data.
The other thing was about overlaps, which can be seen here:
This is at x=56 y=59 so directly next to the "weird jump", so I'll have to debug some more, check other datasets etc...
Indeed, the left part of the plot before the jump should be actually on the right side... Otherwise it looks fine. And there is now empty frames at the top as we have seen with the small test dataset.
@kruzaeva, for the moment you may use this region for your testing purposes:
Just in case, black part at the left bottom corner is a non-transparent counter electrode - it isn't an artefact.
Indeed, there are some frames at the beginning we need to skip. This is my first try (cropped a little, to better fit available RAM):
Now, I more or less manually skipped the first 56 frames. So this is still a syncronization problem. And 56 is not perfect either: the first column appears to be wrong.
But if we skip 57 frames instead:
Now the last column appears wrong.
Looking at a frame in that region (x=290, y=40 while skipping 57 pixels at the beginning) we get these famous overlapping patterns:
Is it possible that these are actually in the file and result from scanning/synchronization errors?
Here is a screenshot with skip_frames=57
on the 02_0V
dataset:
Now the question remains: how to automatically find out how many frames to skip? I had a look into the gtg file, and did not find anything that could be the number of frames to skip, either directly, by frame id or by block index.
How I currently syncronize: first, I synchronize sectors such that each incomplete frame is skipped over and I find offsets such that all sectors start with the same frame id. Then I look at the next N frames (where N is currently set to 400) and see if the frame id makes a jump backwards (example: in one dataset, the frames start at ~284799, but about 190 frames later, they reset to 2 and count up again). Until now, I assumed that directly after this jump, the "real" data starts, but it appears that this is not the case. Permalink to the sync code.
@woozey do you think you could ask your contact at Gatan how this is supposed to work, and point them at the last few comments of this issue?
@sk1p It might very well be that we are looking at an issue with the K2 in combination with the scan unit (@woozey that's a Digiscan, right?). As far as I know, using the K2 like this is not its main application and a bit experimental. For that reason it would be good if we can use the LiberTEM GUI for debugging and tuning. In general, that's a useful application when setting up a combination of camera and scan unit.
What would be the best way to give users an interface to skip the right number of frames? With the GUI we can look at the scanned image and the individual frames. Could it be possible to first load and display as much data as is in the file and let the user decide where to crop for further processing? I mean, the EMPAD has the same effect of having some "extra" data in the file, right?
Maybe it would make sense to keep sth like the "load dataset" dialogue open above the analyses? A user can make changes to the settings and LiberTEM updates the current analyses resp. grays them out for the user to press "Apply" again?
@woozey @sk1p The last column is probably the "flyback" when the beam moves from the end of one line to the beginning of the next, right? I guess that could generally be skipped?
Something to note about a possible C/C++ implementation of uint12 decoding: reinterpreting char as uint32 is not safe in all cases. See also:
http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
I split off the uint12 decoding and gtg reading into separate issues, so we can concentrate on the remaining "correctness" and synchronization issues here.
@uellue you were at the Gatan booth at IMC, right? Did you ask about the frame skipping stuff?
@sk1p Yes! :-) I have on my agenda today to send an e-mail about that.
@woozey can you confirm 1) that the data was acquired in IS mode (the gtg file says it was, any other way to find out?), 2) what firmware version is/was installed on the K2?