European-XFEL / EXtra-data

Access saved EuXFEL data
https://extra-data.rtfd.io
BSD 3-Clause "New" or "Revised" License
7 stars 14 forks source link

Add data_availability method for multi-module detectors #504

Closed takluyver closed 8 months ago

takluyver commented 8 months ago

@turkot this is something we were discussing yesterday.

If you have 200 trains of data with 40 frames per train recorded by an AGIPD-1M detector (for instance), this would give you a (16, 8000) array, with True for frames that are present, and False for any that are missing.

Open questions:

  1. It could be the opposite, telling you where data is missing, if we think we'll want that more often. Flipping it with the ~ operator is easy in either case.
  2. It could give you results by train (16, 200) with either booleans for present/absent, or counts of how many frames are in the train (the count should be the same across modules, except where it's 0).
  3. We could also expose it on the detector object so agipd.data_availability() is the same as agipd['image.data'].data_availability() (using the _main_data_key).
turkot commented 8 months ago

Hi @takluyver , thank you for adding this functionality, it is very useful for validating data reduction results. Regarding you questions:

  1. It could be the opposite, telling you where data is missing, if we think we'll want that more often. Flipping it with the ~ operator is easy in either case.

I like the current approach more, with True for available data.

  1. It could give you results by train (16, 200) with either booleans for present/absent, or counts of how many frames are in the train (the count should be the same across modules, except where it's 0).

No, I think the current approach with data per frame is better.

  1. We could also expose it on the detector object so agipd.data_availability() is the same as agipd['image.data'].data_availability() (using the _main_data_key).

Yes, such simplification will very convenient - could you please add it? I think it would make sense to keep both interfaces, agipd[key].data_availability() for any data key, and agipd.data_availability() for the main data key.

turkot commented 8 months ago

Actually, would it make sense to flip the dimensions and have the frame index first? E.g. (8000, 16).

takluyver commented 8 months ago

Actually, would it make sense to flip the dimensions and have the frame index first? E.g. (8000, 16).

My reasoning for doing it this way round is that if you do e.g. agipd['image.data'].ndarray(), you would get an array of (16, 8000, 512, 128), so this matches the order for the relevant dimensions. Having dimensions in a predictable order seems like a pretty good idea, and it's easy to transpose them with Numpy if you want the other order. Do you think the (8000, 16) shape will be the one we want so often to override that?

takluyver commented 8 months ago

Thanks Oleksii! I've added the method to the detector classes now as well, along with a test for the behaviour.

turkot commented 8 months ago

Thank you Thomas!

Do you think the (8000, 16) shape will be the one we want so often to override that?

No, I think it is fine. I was just going from the direction that in the VDS files we have e.g.: data [float32: 155955 × 16 × 512 × 128]

LGTM!