European-XFEL / EXtra-data

Access saved EuXFEL data
https://extra-data.rtfd.io
BSD 3-Clause "New" or "Revised" License
7 stars 13 forks source link

Inconsistency between karabo-bridge-serve-files and live data #330

Open FilipeMaia opened 2 years ago

FilipeMaia commented 2 years ago

I'm using karabo-bridge-serve-files on recorded data to prepare for online analysis in an upcoming beamtime. To be most useful it would be great if karabo-bridge-serve-files would be as close as possible to reading live data. It seems that live data includes timestamps but karabo-bridge-serve-files does not.

Are there other differences between live data and that served by karabo-bridge-serve-files?

FilipeMaia commented 2 years ago

For example, previous code seems to suggest that live raw AGIPD data came in two arrays, image.data and image.gain. Is this still the case? This never happens with karabo-bridge-serve-files.

FilipeMaia commented 2 years ago

Also with live that the fastest changing dimension in an AGIPD image.data array is cell number while that seems to be the slowest with data served by karabo-bridge-serve-files. Is there a better way to simulate live data?

FilipeMaia commented 2 years ago

I found there's a --dummy-timestamps parameter! But the other questions still remain.

philsmt commented 2 years ago

Hi Filipe,

Unfortunately it is possible there are inconsistencies between live and recorded data. This mostly comes down to differences in structure between Karabo's Hash protocol and how the DAQ ends up laying it out in HDF files, as well as the functional distinctions between online and offline corrections, the latter creating an entirely new set of files. I'm sorry this is not universally in a good state yet. A complete answer to your question unfortunately depends on a lot of factors like which kind of input files are used for karabo-bridge-serve-files and where the karabo-bridge was connected to, which are of course lots of details you should not need to concern yourself with as a user.

Some initial observations I can probably make:

The fundamental difference in data layout between raw and corrected data in both offline and online is desired to homogenize the corrected data format as much as possible between detectors, which not all share details such as gain thresholding. In addition even with carrying the gain information here the change in data type motivates moving it a different key.

Did you consider whether you want to work with either raw data or corrected AGIPD data exclusively, or possibly both? Note that with the new correction software there is no performance difference between the two.

FilipeMaia commented 2 years ago

Hi Philipp,

Thanks for the detailed reply!

I understand that there will be differences between the data saved in hdf5 and the streamed data but it would be very useful if karabo-bridge-serve-files would be able to translate between the hdf5 format and the streamed format so we could use it for testing online analysis codes.

I think you're correct in identifying the reasons for the discrepancies. I'm currently using https://github.com/European-XFEL/EXtra-foam/blob/dev/extra_foam/pipeline/processors/image_assembler.py as a guide for the differences between online and offline.

Given that there are multiple ways that the data can be streamed online (e.g. you mentioned being able to reshape in any way you wish) does the stream contain some information to tell us how the data is being shaped or if it comes from a file or is live? Even some version information about the streamer could be useful.

We'll try out the corrected data, but it would be useful if one could have the option to also access the raw data.

FilipeMaia commented 2 years ago

Also is there any documentation on the new correction software/zmq bridge and is the code available somewhere?

philsmt commented 2 years ago

I understand that there will be differences between the data saved in hdf5 and the streamed data but it would be very useful if karabo-bridge-serve-files would be able to translate between the hdf5 format and the streamed format so we could use it for testing online analysis codes.

Definitely, I've raised it internally and we should aim to offer options to solve this automatically. That being said, there's also a somewhat less documented tool (which is being addressed right now) called karabo-bridge-recorder (you will need to login, but it should be accessible) that records an actual data stream verbatim. Naturally it does not help after the fact, but can be used to replay an authentic online stream at any later point in time.

Given that there are multiple ways that the data can be streamed online (e.g. you mentioned being able to reshape in any way you wish) does the stream contain some information to tell us how the data is being shaped or if it comes from a file or is live? Even some version information about the streamer could be useful.

Not really, the karabo-bridge so far was seen only for online analysis and not a universal data streaming format. I understand your motivation as maintainer of a cross-facility tool for versioning. Our current plan is to actually coalesce on a standardized format in the first place, but we will keep versioning it in mind for future changes. Concerning the memory order example, while it is not contained right now in the stream, it could be added easily in its string representation (it's specified by a letter code, e.g. cxy puts y as fastest axis and cell as slowest). Ultimately it should be set to whatever works for you best, as memory order can make quite a drastic impact for the data sizes we're speaking of with full-rate detector data.

Also is there any documentation on the new correction software/zmq bridge and is the code available somewhere?

We're preparing documentation for the new online correction software as we speak. You can find a build of the latest version here, but please keep in mind things are in flux and links may point to git.xfel.eu repositories. In most cases I would expect your account is able to access it anyway. It is still being expanded, but if you find anything particular missing, don't hesitate to tell us please!

FilipeMaia commented 2 years ago

Definitely, I've raised it internally and we should aim to offer options to solve this automatically. That being said, there's also a somewhat less documented tool (which is being addressed right now) called karabo-bridge-recorder (you will need to login, but it should be accessible) that records an actual data stream verbatim.

Ah I didn't know about it. Do you guys have some sample stream recorded I could use?

Not really, the karabo-bridge so far was seen only for online analysis and not a universal data streaming format. I understand your motivation as maintainer of a cross-facility tool for versioning. Our current plan is to actually coalesce on a standardized format in the first place, but we will keep versioning it in mind for future changes.

Standardising is good, but that does not remove the need for some version information (because standards evolve). You could even have version information from the different parts, for example from the bridge itself but also from the calibration pipeline (now calng). It should also contain information about memory order like you suggest. This would make it much simpler for downstream software to handle the data (at the moment I'm looking at the different dimensions and guessing which ones correspond to the x and y axis of the modules assuming they are 512x128, which is too fragile).

We're preparing documentation for the new online correction software as we speak. You can find a build of the latest version here, but please keep in mind things are in flux and links may point to git.xfel.eu repositories.

That's great! You should spread this information more widely. I think many people would like to know exactly how the calibration is being done to be able to trust it.