Questions on the Nature Communications paper

biochem-fan commented 3 years ago

I read your paper in Nature Communications, "A data reduction and compression description for high throughput time-resolved electron microscopy" https://doi.org/10.1038/s41467-020-20694-z. I enjoyed your detailed description and analysis of how to reduce "puddles" to event information, since such algorithm are usually hardware implemented and rarely described.

I have several questions on your use of single particle cryoEM datasets from EMPIAR as examples.

First, Table 2 entries No 1 to 3 say Falcon II but EMPIAR-10299 is from K2; the image size of 7676 x 7420 pixels indicates that they are from K2 detector in the super-resolution mode.

Next, most detectors used in CryoEM (K2, K3, Falcon III EC and Falcon4) work in the so-called counting mode. In K2 and K3 especially, an electron puddle spanning multiple pixels is reduced to one dot. This is done by hardware (FPGA). In other words, your L4 reduction step is already done by the detector before movies are written. I am not sure if it is justified to repeat reduction steps. By reducing connected non-zero pixels into one, we lose electron events. In Falcon 3 and Falcon 4 detectors, electron events are first localized (reduced) to dots on a super-resolution grid and then rendered as 3x3 blobs. In this case, reducing blobs back to dots by your L4 algorithm might be useful for compression. Please look at https://www3.mrc-lmb.cam.ac.uk/relion/index.php/Image_compression for examples of detectors used in CryoEM and lossless compression used there.

Finally, did you confirm ReCoDe does not affect the final resolution by re-processing EMPIAR movies? You compared MTF curves of DE-16 but what really matters is DQE. Moreover, the results might not apply to K2/K3 movies because they are already "reduced" by the detector as explained above.

duaneloh commented 3 years ago

Hi! Thanks for the questions! Let me start by the easier one: why use ReCoDe at all for already counted data on EMPIAR? Indeed, for EMPIAR datasets whose frames are not sparse (e.g., all pixels have non-zero values), ReCoDe will only act as a lossless compressor — ReCoDe’s reduction step does very little, and only its downstream lossless compression algorithms (by default) reduce the file sizes. So it acts no differently than the compression algorithms commonly used in cryoEM.

However, the situation changes when the datasets (e.g., movies) are sparse, then ReCoDe can both reduce the electron puddles and compress thereafter. That is why we compared ReCoDe to MRCZ at very low dose rates.

The EMPIAR datasets we looked at were already counted (by detector hardware), and hence were reduced losslessly by ReCoDe (i.e., decompressed output is identical to the input), we are certain that reprocessing EMPIAR datasets will not impact cryoEM reconstructions’ resolution. The current scope of the paper doesn’t address DQE, and certainly not with the EMPIAR datasets (which weren’t deposited as truly raw detector data).

So in terms of existing cryoEM pipelines that have to use vendor-supplied reduction that is built into the detector, ReCoDe mainly serves as file size reduction. We added this bit into the manuscript mostly prompted by reviewers’ request.

But more important message here is the following. By studying and doing electron counting, reduction, compression ourselves, we demystify the process and show that it really isn’t that expensive to do. If certain choices were made, we can have continuous exposure low-dose rate movies for both cryoEM and in-situ materials imaging (essentially any application?). We can only speculate why certain detectors do not support electron-counted movies, when we show it is possible to do so essentially non-stop for days since the post ReCoDe data is small enough to be streamed over 10GbE onto a remote storage computer. Perhaps by being completely open in our approach, our paper can drive vendor support for affordable electron counted movies for ALL applications.

The fact that DE was willing to open up their platform for us is indeed a boon. We are now processing such super-long movie data for ourselves (see preview at https://news.nus.edu.sg/cbis-duane-loh-electron-microscopy-computational-lens/).

biochem-fan commented 3 years ago

Thank you very much for your response.

Indeed, for EMPIAR datasets whose frames are not sparse (e.g., all pixels have non-zero values), ReCoDe will only act as a lossless compressor — ReCoDe’s reduction step does very little, and only its downstream lossless compression algorithms (by default) reduce the file sizes. So it acts no differently than the compression algorithms commonly used in cryoEM.

Does this automatically happen? Or did you disable reduction steps by a command line option?

For example, suppose you have 3 electron events in a counted K2 movie frame:

Here, "frame" means not the raw detector frame, but the frame written by DigitalMicrograph. Internally K2 runs at hundreds of frames per second, locates electrons in each frame, replacing puddles into dots, and then sums multiple raw frames into an output frame (in Falcon this is called a "fraction"). So these 2 and 1 represent genuine three events.

My concern is that your L4 algorithm might reduce them to:

Doesn't this happen?

I am interested in performing compression analysis myself and comparing with existing lossless compression methods.

Can you put the list of file names and command line options you used for Figure 7a and supplementary figure 7a? I guess you didn't compress all movies from these EMPIAR entries, but only a random subset.

Note that the floating point data was converted to unsigned integer by normalising the pixel values to the 0-4096 range and rounding to the nearest integer

Why did you use the 0-4096 range (12 bits), not the full 16 bit integer range?

Is the reported compression ratio relative to the file size in 16 bit integers, not relative to the original MRC files (some in 8 bit int, some in 32 bit float)?

When you wrote "identical to the input", do you mean the input after normalizing to 16 bit (actually 12 bit) integers, not the original 32 bit floating points?

But more important message here is the following.

Yes, I totally agree with this paragraph. I wish other detector manufacturers provide access to raw detector frames.

duaneloh commented 3 years ago

Hi! I want to quickly address an important point you mentioned: don't do L4 reduction on post-counted data!

Does this automatically happen? Or did you disable reduction steps by a command line option? You can specify the reduction level with the config file, L1, L2, L3, or L4. I believe there's no default set for reduction level.

The comparisons in our paper, and what I'd implied above is with L1 reduction, which basically only reduces away the zero-valued pixels before sending it to lossless compression algorithms of your choice.

So your example below with a single puddle 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 becomes stored similarly to (puddle_x, puddle_y, (2,1)) in the ReCoDe format.

I'll let Abhik tackle the technical questions.

abhikdatta commented 3 years ago

Hi @biochem-fan, thanks for your interest in ReCoDe.

Let me answer some of the questions that @duaneloh hasn't already answered.

First, Table 2 entries No 1 to 3 say Falcon II but EMPIAR-10299 is from K2; the image size of 7676 x 7420 pixels indicates that they are from K2 detector in the super-resolution mode.

This is an oversight on my part. Given the file sizes, these datasets do seem to be from the K2. The original paper by Casanal et al., mentioned using both the Falcon II and the K2 for different experiments and I got confused, and thought only the dataset tagged MRCS_Diamond were from K2 and the rest were Falcon II.

Why did you use the 0-4096 range (12 bits), not the full 16 bit integer range?

This was done only for Supplementary Figure S7, to make the EMPIAR datasets comparable with the simulated data, which follows the 12-bit data range of DE-16 data. The objective of this experiment was to show that the compression rates follow the same pattern in both simulated and EMPIAR datasets. In Figure 7, in the main text, the EMPIAR datasets were not normalized.

Is the reported compression ratio relative to the file size in 16 bit integers, not relative to the original MRC files (some in 8 bit int, some in 32 bit float)?

The reported compression ratios are relative to the file size in 16-bit integers, as data is typically not stored in 12-bits.

When you wrote "identical to the input", do you mean the input after normalizing to 16 bit (actually 12 bit) integers, not the original 32 bit floating points?

By "identical to the input" we mean identical to the normalized 12 bit integers.

Can you put the list of file names and command line options you used for Figure 7a and supplementary figure 7a? I guess you didn't compress all movies from these EMPIAR entries, but only a random subset.

Yes, I only used a small random subset. I will list them here in a follow-up comment. As far as the options go, you can specify them using a config file (see the examples in pyReCoDe/configs). For the experiments in Figure 7, I set the config parameters programtically, as each dataset requires slightly different set of parameters. As an example, for the experiments in Supllementary Figure S7, a config can be created as follows:

input_params = InputParams()
input_params.load('pyReCoDe/configs/recode_params_minimal_read_write_test.txt')
input_params._param_map['num_cols'] = _data.shape[2]
input_params._param_map['num_rows'] = _data.shape[1]
input_params._param_map['num_frames'] = _data.shape[0]
input_params._param_map['source_bit_depth'] = _data.itemsize*8      # 16 for 16-bit
input_params._param_map['target_bit_depth'] = _data.itemsize*8      # 16 for 16-bit
input_params._param_map['source_data_type'] = 0                  # unsigned int
input_params._param_map['target_data_type'] = 0                  # unsigned int

where "_data" is a numpy array containing the data to be compressed.

biochem-fan commented 3 years ago

Thank you very much for clarification. This is very useful.

don't do L4 reduction on post-counted data!

This perfectly makes sense.

The reported compression ratios are relative to the file size in 16-bit integers, as data is typically not stored in 12-bits. By "identical to the input" we mean identical to the normalized 12 bit integers.

I see. I understood how Supplementary Fig 7a was calculated.

In Figure 7, in the main text, the EMPIAR datasets were not normalized.

Then how about the (main) Fig 7a, not supplementary fig 7?

Are the compression ratio relative to the original file size (some float32, some int8)? Are the results lossless against the original float32?

Our cluster storage is always 90 % full, so I am very interested in trying efficient compression algorithms.

abhikdatta commented 3 years ago

Then how about the (main) Fig 7a, not supplementary fig 7?

Are the compression ratio relative to the original file size (some float32, some int8)?

The compression ratios are relative to the size of the original data: the number of bytes needed to store the data in the original format. For instance, one 100x100 frame of float32 requires 40000 bytes. This is considered the original size. We adopted this approach to make the data comparable across the different source file formats. As some source file formats, such as TIFF, internally supports LZW compression, whereas others such as MRC do not have any compression.

Are the results lossless against the original float32?

Yes, the results are lossless against the original float32.

biochem-fan commented 3 years ago

relative to the size of the original data Yes, the results are lossless against the original float32.

Great! Then the result looks very promising.

NDLOHGRP / pyReCoDe

Questions on the Nature Communications paper #2