bluesky / suitcase-core

data export facilities for NSLS-II
https://blueskyproject.io/suitcase
Other
2 stars 13 forks source link

Suitcase export error at CHX #9

Closed ericdill closed 5 years ago

ericdill commented 8 years ago

@ordirules reported this export error to me via email:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-2f4252d32139> in <module>()
----> 1 suitcase.export(cfndb,"/home/lhermitte/test.hd5")

/opt/conda_envs/analysis/lib/python3.4/site-packages/suitcase-0.2.2-py3.4.egg/suitcase.py in export(headers, filename)
     46                 data_group = desc_group.create_group('data')
     47                 ts_group = desc_group.create_group('timestamps')
---> 48                 [fill_event(e) for e in events]
     49                 for key, value in data_keys.items():
     50                     value = dict(value)

/opt/conda_envs/analysis/lib/python3.4/site-packages/suitcase-0.2.2-py3.4.egg/suitcase.py in <listcomp>(.0)
     46                 data_group = desc_group.create_group('data')
     47                 ts_group = desc_group.create_group('timestamps')
---> 48                 [fill_event(e) for e in events]
     49                 for key, value in data_keys.items():
     50                     value = dict(value)

/opt/conda_envs/analysis/lib/python3.4/site-packages/databroker/databroker.py in fill_event(event, handler_registry, handler_overrides)
    280         if is_external.get(data_key, False):
    281             if data_key not in handler_overrides:
--> 282                 event.data[data_key] = fs.retrieve(value, handler_registry)
    283             else:
    284                 mock_registry = mock_registries[data_key]

/opt/conda_envs/analysis/lib/python3.4/site-packages/filestore/api.py in get_data(eid, handle_registry)
    154         handle_registry = {}
    155     with _FS_SINGLETON.handler_context(handle_registry) as fs:
--> 156         return fs.get_datum(eid)
    157
    158

/opt/conda_envs/analysis/lib/python3.4/site-packages/filestore/fs.py in get_datum(self, eid)
    113         return _get_datum(self._datum_col, eid,
    114                           self._datum_cache, self.get_spec_handler,
--> 115                           logger)
    116
    117     def register_handler(self, key, handler, overwrite=False):

/opt/conda_envs/analysis/lib/python3.4/site-packages/filestore/core.py in get_datum(col, eid, datum_cache, get_spec_handler, logger)
     36                         "datum cache can hold.")
     37
---> 38     handler = get_spec_handler(datum['resource'])
     39     return handler(**datum['datum_kwargs'])
     40

/opt/conda_envs/analysis/lib/python3.4/site-packages/filestore/fs.py in get_spec_handler(self, resource)
    205
    206         spec = resource['spec']
--> 207         handler = self.handler_reg[spec]
    208         key = (str(resource['_id']), handler.__name__)
    209

/opt/conda_envs/analysis/lib/python3.4/collections/__init__.py in __getitem__(self, key)
    803             except KeyError:
    804                 pass
--> 805         return self.__missing__(key)            # support subclasses that define __missing__
    806
    807     def get(self, key, default=None):

/opt/conda_envs/analysis/lib/python3.4/collections/__init__.py in __missing__(self, key)
    795
    796     def __missing__(self, key):
--> 797         raise KeyError(key)
    798
    799     def __getitem__(self, key):

KeyError: 'AD_EIGER'
ericdill commented 8 years ago

Suggested fix was:

from eiger_io.fs_handler import LazyEigerHandler
from filestore.api import register_handler
register_handler("AD_EIGER", LazyEigerHandler)
ericdill commented 8 years ago

@ordirules also asked about the possibility of not exporting the Eiger data into the hdf5 file. That is a great suggestion, but for now I suggested that he could just copy the export function and comment out https://github.com/NSLS-II/suitcase/blob/master/suitcase.py#L48.

ericdill commented 8 years ago

Now there's a new error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-51-630ca4a4c95a> in <module>()
----> 1 export(cfndb, "/home/lhermitte/test.hd5")

<ipython-input-49-b23001810fc6> in export(headers, filename, savedata)
     56                     data = [e['data'][key] for e in events]
     57                     dataset = data_group.create_dataset(
---> 58                         key, data=data, compression='gzip', fletcher32=True)
     59                     # Put contents of this data key (source, etc.)
     60                     # into an attribute on the associated data set.

/opt/conda_envs/analysis/lib/python3.4/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    101         """
    102         with phil:
--> 103             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    104             dset = dataset.Dataset(dsid)
    105             if name is not None:

/opt/conda_envs/analysis/lib/python3.4/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
     85         else:
     86             dtype = numpy.dtype(dtype)
---> 87         tid = h5t.py_create(dtype, logical=1)
     88 
     89     # Legacy

h5py/h5t.pyx in h5py.h5t.py_create (/home/ilan/minonda/conda-bld/work/h5py/h5t.c:16162)()

h5py/h5t.pyx in h5py.h5t.py_create (/home/ilan/minonda/conda-bld/work/h5py/h5t.c:15993)()

h5py/h5t.pyx in h5py.h5t.py_create (/home/ilan/minonda/conda-bld/work/h5py/h5t.c:15953)()

TypeError: No conversion path for dtype: dtype('<U36')

But I'm not sure if this is a result of registering the eiger handler or copy/pasting the export code and commenting out the fill_event line. Can you show me the code that resulted in this error @ordirules?

jrmlhermitte commented 8 years ago

Here's a piece of the code below. Thanks for the efforts in getting this temporary fix to work. As an aside, if I want to share code, is there an easier/recommended way to do it? (for ex: pastebin etc)

import suitcase
from databroker import DataBroker as db, get_images, get_table, get_events, get_fields
from eiger_io.pims_reader import EigerImages
import datetime
from eiger_io.fs_handler import LazyEigerHandler
from filestore.api import register_handler
register_handler("AD_EIGER", LazyEigerHandler)

# The suitcase object (commented out data saving)
from collections import Mapping
import warnings
import h5py
import json
from metadatastore.commands import find_events
from databroker.databroker import fill_event

__version__ = "0.2.2"

def export(headers, filename, savedata=False):
    """
    Parameters
    ----------
    headers : a Header or a list of Headers
        objects retruned by the Data Broker
    filename : string
        path to a new or existing HDF5 file
    """
    with h5py.File(filename) as f:
        for header in headers:
            header = dict(header)
            try:
                descriptors = header.pop('descriptors')
            except KeyError:
                warnings.warn("Header with uid {header.uid} contains no "
                              "data.".format(header), UserWarning)
                continue
            top_group_name = header['start']['uid']
            group = f.create_group(top_group_name)
            _safe_attrs_assignment(group, header)
            for i, descriptor in enumerate(descriptors):
                # make sure it's a dictionary and trim any spurious keys
                descriptor = dict(descriptor)
                descriptor.pop('_name', None)

                desc_group = group.create_group(descriptor['uid'])

                data_keys = descriptor.pop('data_keys')
                _safe_attrs_assignment(desc_group, descriptor)

                events = list(find_events(descriptor=descriptor))
                event_times = [e['time'] for e in events]
                desc_group.create_dataset('time', data=event_times,
                                          compression='gzip', fletcher32=True)
                data_group = desc_group.create_group('data')
                ts_group = desc_group.create_group('timestamps')
                if savedata:
                    [fill_event(e) for e in events]
                for key, value in data_keys.items():
                    value = dict(value)
                    timestamps = [e['timestamps'][key] for e in events]
                    ts_group.create_dataset(key, data=timestamps,
                                        compression='gzip',
                                        fletcher32=True)
                    data = [e['data'][key] for e in events]
                    dataset = data_group.create_dataset(
                        key, data=data, compression='gzip', fletcher32=True)
                    # Put contents of this data key (source, etc.)
                    # into an attribute on the associated data set.
                    _safe_attrs_assignment(dataset, dict(value))
def _clean_dict(d):
    d = dict(d)
    for k, v in list(d.items()):
        # Store dictionaries as JSON strings.
        if isinstance(v, Mapping):
            d[k] = _clean_dict(d[k])
            continue
        try:
            json.dumps(v)
        except TypeError:
            d[k] = str(v)
    return d

def _safe_attrs_assignment(node, d):
    d = _clean_dict(d)
    for key, value in d.items():
        # Special-case None, which fails too late to catch below.
        if value is None:
            value = 'None'
        # Try storing natively.
        try:
            node.attrs[key] = value
        # Fallback: Save the repr, which in many cases can be used to
        # recreate the object.
        except TypeError:
            node.attrs[key] = json.dumps(value)

cfndb = db(user="CFN", start_time="2016-03-25", stop_time="2016-03-29")
detector = 'eiger4m_single_image'

export(cfndb, "/home/lhermitte/test.hd5")
ericdill commented 8 years ago

what is the type of cfndb?

jrmlhermitte commented 8 years ago

It's a list of headers returned by databroker (databroker.databroker.Header)

ericdill commented 8 years ago

Ok great. I didnt think that was an issue but I did want to check.

So hdf5 doesn't like unicode. Can you add a print(key) at the beginning of the for key, value in data_keys.items(): loop?

ericdill commented 8 years ago

(so that we can see which key is failing)

ericdill commented 8 years ago

You can share code on gist.github.com too, if that's any easier. Copy/pasting into a github issue is a pretty common pattern though.

jrmlhermitte commented 8 years ago

it's 'eiger4m_single_image'

ericdill commented 8 years ago

Can you print the header and show me the contents. I am specifically interested in the descriptors.

print(cfndb[0].descriptors)

jrmlhermitte commented 8 years ago

Here is the full header printed and raw output of descriptor further below. thanks!


======

  EventDescriptor
  ---------------
  configuration   :
          upper_ctrl_limit: 0.0                                     
          source          : PV:XF:11IDB-ES{Det:Eig4M}cam1:DetDist_RBV
          precision       : 3                                       
          units           : m                                       
          shape           : []                                      
          dtype           : number                                  
          lower_ctrl_limit: 0.0                                     
          upper_ctrl_limit: 0.0                                     
          source          : PV:XF:11IDB-ES{Det:Eig4M}cam1:BeamX_RBV 
          precision       : 3                                       
          units           : pixels                                  
          shape           : []                                      
          dtype           : number                                  
          lower_ctrl_limit: 0.0                                     
          upper_ctrl_limit: 0.0                                     
          source          : PV:XF:11IDB-ES{Det:Eig4M}cam1:Wavelength_RBV
          precision       : 4                                       
          units           : Angstro                                 
          shape           : []                                      
          dtype           : number                                  
          lower_ctrl_limit: 0.0                                     
          upper_ctrl_limit: 0.0                                     
          source          : PV:XF:11IDB-ES{Det:Eig4M}cam1:BeamY_RBV 
          precision       : 3                                       
          units           : pixels                                  
          shape           : []                                      
          dtype           : number                                  
          lower_ctrl_limit: 0.0                                     
        eiger4m_single_d: 4.84                                    
        eiger4m_single_b: 1209.0                                  
        eiger4m_single_w: 1.4251056909561157                      
        eiger4m_single_b: 1327.0                                  
        eiger4m_single_d: 1459084275.190846                       
        eiger4m_single_b: 1459084274.753193                       
        eiger4m_single_w: 1459084274.753201                       
        eiger4m_single_b: 1459084274.753195                       
  +-----------------------------+--------+------------+----------------+-----------+-----------------+-------------------------------------------+-------+
  | data keys                   | dtype  |  external  |  object_name   | precision |      shape      |                   source                  | units |
  +-----------------------------+--------+------------+----------------+-----------+-----------------+-------------------------------------------+-------+
  | eiger4m_single_image        | array  | FILESTORE: | eiger4m_single |           | [2070, 2167, 0] |         PV:XF:11IDB-ES{Det:Eig4M}         |       |
  | eiger4m_single_stats1_total | number |            | eiger4m_single |     0     |        []       | PV:XF:11IDB-ES{Det:Eig4M}Stats1:Total_RBV |       |
  | eiger4m_single_stats2_total | number |            | eiger4m_single |     0     |        []       | PV:XF:11IDB-ES{Det:Eig4M}Stats2:Total_RBV |       |
  | eiger4m_single_stats3_total | number |            | eiger4m_single |     0     |        []       | PV:XF:11IDB-ES{Det:Eig4M}Stats3:Total_RBV |       |
  | eiger4m_single_stats4_total | number |            | eiger4m_single |     0     |        []       | PV:XF:11IDB-ES{Det:Eig4M}Stats4:Total_RBV |       |
  | eiger4m_single_stats5_total | number |            | eiger4m_single |     0     |        []       | PV:XF:11IDB-ES{Det:Eig4M}Stats5:Total_RBV |       |
  +-----------------------------+--------+------------+----------------+-----------+-----------------+-------------------------------------------+-------+
  name            : primary                                 
  object_keys     :
    eiger4m_single  : ['eiger4m_single_image', 'eiger4m_single_stats1_total', 'eiger4m_single_stats2_total', 'eiger4m_single_stats3_total', 'eiger4m_single_stats4_total', 'eiger4m_single_stats5_total']
  run_start       : bab70c5b-6f8a-4f2e-9394-491ef449842e    
  time            : 1459160831.191348                       
  uid             : bada393a-c326-41b4-a746-82948ce624fd    

  RunStart
  --------
  beamline_id     : CHX                                     
  config          :
  detectors       : ['eiger4m_single']                      
  energy_keV      : 8.687                                   
  experiment      : XPCS                                    
  exposure_time   : 300.0                                   
  extra           : Aaron Stein remade the ReferenceDot sample (10 um lines making grid on 75 um pitch, with various dot patterns within)
  group           : chx                                     
  holder          : vacuum bar holder                       
  measure_type    : N6 Pattern 015                          
  name            : ReferenceDots03 again2 (air holder)     
  owner           : xf11id                                  
  plan_args       :
    delay           : 0                                       
    num             : 1                                       
    detectors       : [EigerSingleTrigger(prefix='XF:11IDB-ES{Det:Eig4M}', name='eiger4m_single', read_attrs=['file', 'stats1', 'stats2', 'stats3', 'stats4', 'stats5'], configuration_attrs=['beam_center_x', 'beam_center_y', 'wavelength', 'det_distance'], monitor_attrs=[])]
  plan_type       : Count                                   
  project         :                                         
  sample          :
    x               : 0.794786                                
    holder          : vacuum bar holder                       
    y               : -0.1821799999999998                     
    name            : ReferenceDots03 again2 (air holder)     
    extra           : Aaron Stein remade the ReferenceDot sample (10 um lines making grid on 75 um pitch, with various dot patterns within)
  scan_id         : 13858                                   
  sequence_ID     : 2751.0                                  
  time            : 1459160529.530022                       
  uid             : bab70c5b-6f8a-4f2e-9394-491ef449842e    
  user            : CFN                                     
  x               : 0.794786                                
  x_position      : 0.794786                                
  y               : -0.1821799999999998                     
  y_position      : -0.1821799999999998                     

  RunStop
  -------
  exit_status     : success                                 
  reason          :                                         
  run_start       : bab70c5b-6f8a-4f2e-9394-491ef449842e    
  time            : 1459160831.2157428                      
  uid             : 0abf0fa8-c558-455c-bcb3-2daa705c1e1b 

And finally the output of the descriptor:

[{'uid': 'bada393a-c326-41b4-a746-82948ce624fd', 'configuration': {'eiger4m_single': {'data_keys': {'eiger4m_single_det_distance': {'upper_ctrl_limit': 0.0, 'source': 'PV:XF:11IDB-ES{Det:Eig4M}cam1:DetDist_RBV', 'precision': 3, 'units': 'm', 'shape': [], 'dtype': 'number', 'lower_ctrl_limit': 0.0}, 'eiger4m_single_beam_center_x': {'upper_ctrl_limit': 0.0, 'source': 'PV:XF:11IDB-ES{Det:Eig4M}cam1:BeamX_RBV', 'precision': 3, 'units': 'pixels', 'shape': [], 'dtype': 'number', 'lower_ctrl_limit': 0.0}, 'eiger4m_single_wavelength': {'upper_ctrl_limit': 0.0, 'source': 'PV:XF:11IDB-ES{Det:Eig4M}cam1:Wavelength_RBV', 'precision': 4, 'units': 'Angstro', 'shape': [], 'dtype': 'number', 'lower_ctrl_limit': 0.0}, 'eiger4m_single_beam_center_y': {'upper_ctrl_limit': 0.0, 'source': 'PV:XF:11IDB-ES{Det:Eig4M}cam1:BeamY_RBV', 'precision': 3, 'units': 'pixels', 'shape': [], 'dtype': 'number', 'lower_ctrl_limit': 0.0}}, 'data': {'eiger4m_single_det_distance': 4.84, 'eiger4m_single_beam_center_x': 1209.0, 'eiger4m_single_wavelength': 1.4251056909561157, 'eiger4m_single_beam_center_y': 1327.0}, 'timestamps': {'eiger4m_single_det_distance': 1459084275.190846, 'eiger4m_single_beam_center_x': 1459084274.753193, 'eiger4m_single_wavelength': 1459084274.753201, 'eiger4m_single_beam_center_y': 1459084274.753195}}}, 'data_keys': {'eiger4m_single_image': {'shape': [2070, 2167, 0], 'dtype': 'array', 'source': 'PV:XF:11IDB-ES{Det:Eig4M}', 'object_name': 'eiger4m_single', 'external': 'FILESTORE:'}, 'eiger4m_single_stats4_total': {'source': 'PV:XF:11IDB-ES{Det:Eig4M}Stats4:Total_RBV', 'precision': 0, 'object_name': 'eiger4m_single', 'shape': [], 'dtype': 'number', 'units': ''}, 'eiger4m_single_stats2_total': {'source': 'PV:XF:11IDB-ES{Det:Eig4M}Stats2:Total_RBV', 'precision': 0, 'object_name': 'eiger4m_single', 'shape': [], 'dtype': 'number', 'units': ''}, 'eiger4m_single_stats5_total': {'source': 'PV:XF:11IDB-ES{Det:Eig4M}Stats5:Total_RBV', 'precision': 0, 'object_name': 'eiger4m_single', 'shape': [], 'dtype': 'number', 'units': ''}, 'eiger4m_single_stats3_total': {'source': 'PV:XF:11IDB-ES{Det:Eig4M}Stats3:Total_RBV', 'precision': 0, 'object_name': 'eiger4m_single', 'shape': [], 'dtype': 'number', 'units': ''}, 'eiger4m_single_stats1_total': {'source': 'PV:XF:11IDB-ES{Det:Eig4M}Stats1:Total_RBV', 'precision': 0, 'object_name': 'eiger4m_single', 'shape': [], 'dtype': 'number', 'units': ''}}, 'time': 1459160831.191348, '_name': 'EventDescriptor', 'name': 'primary', 'run_start': 'bab70c5b-6f8a-4f2e-9394-491ef449842e', 'object_keys': {'eiger4m_single': ['eiger4m_single_image', 'eiger4m_single_stats1_total', 'eiger4m_single_stats2_total', 'eiger4m_single_stats3_total', 'eiger4m_single_stats4_total', 'eiger4m_single_stats5_total']}}]
jrmlhermitte commented 8 years ago

and thanks for the gist reference. I agree, sounds simpler to paste here for now thanks!

ericdill commented 8 years ago

Ok, so basically what I'm going to have you do is ignore any data that is external. At the top of the data_keys.items() for loop, add this:

for key, value in data_keys.items():
    if descriptor['data_keys'][key].get('external'):
        continue

That will skip adding any keys which are in filestore.

Alternatively, you can safely cast it to a string if you do want the filestore reference:

    data = [e['data'][key] for e in events]
    if descriptor['data_keys'][key].get('external'):
        data = [str(d) for d in data]

Makes sense? Let me know how this works

ericdill commented 8 years ago

Or you could also do

if 'external' in descriptor['data_keys'][key]:
jrmlhermitte commented 8 years ago

thanks for explaining this in detail. As you said, I think the main bug was something to do with converting the data string to unicode. Doing this fixed the issue. Ignoring the data like you said also allowed me to not write out the data so that is perfect. Here is what finally worked for me:

import unicodedata
data = [e['data'][key] for e in events]
if data_keys[key].get('external'):
data = [unicodedata.normalize('NFKD', d).encode('ascii','ignore') for d in data]

When I look at the saved metadata (with hdfview), however, I don't see all the metadata that I know I saved with the files. Here is what I see for example: (http://imgur.com/kAa2ib3)

What I would like is for all the metadata to be saved, including custom keys that have been added. Is this possible? thanks!

edit: I should clarify there were two issues with this modification:

  1. data needs to be converted (add import unicodedata somewhere): data = [unicodedata.normalize('NFKD', d).encode('ascii','ignore') for d in data]
  2. descriptor['data_keys'] was already popped into data_keys so data_keys is what needs to be accessed not descriptor['data_keys']
ericdill commented 8 years ago

The metadata that you are looking at is stored as attributes. Here is what I see when I select a header. Look at the bottom of the hdfview window.

2016-03-29-09 25 32-screenshot

jrmlhermitte commented 8 years ago

oops my fault, yes it was hidden :-| Issue resolved, this is perfect and works great, thanks! :-)

ericdill commented 8 years ago

Thanks for the patience. I'll get to work on turning what we discussed in this thread into an example in the docs for this project (as time permits!)

jrmlhermitte commented 8 years ago

Great thanks! For completeness, I also encountered another error: ValueError: Unable to create group (Name already exists) This is because the file exists. It might be a good idea to add some checking for file existence and possibly an overwrite flag? Anyway, I just fixed by modifying this line ( add a "w" in the h5py file opener):

    with h5py.File(filename,"w") as f:
ericdill commented 8 years ago

Ah ok. I'll have a think on what the best way to handle this is. Thanks for the bug report and clear explanation

jrmlhermitte commented 8 years ago

ok thanks. oh, also the main idea for this is to have a temporary local database of the metadata. suitcase helps export it but it might be preferable to convert from hdf5 to something that can be searched more efficiently. Do you have any recommendations? thanks again for all the quick and efficient support!

ericdill commented 8 years ago

On the readme of this repository we enumerate two things that we want suitcase to do. As you say "suitcase helps export" which maps onto (1) and "the main idea for this is to have a temporary local database of the metadata" is (2). We have implemented (1) but not yet (2). Our goal is to be able to export the headers that you care about into a local databroker-like interface. I am glad to hear that you want something like this. It is validation that we are on the right track. We will get there, but for now I do not have any recommendations regarding how to turn the output of export() into a local database because I have not spent any time trying to solve that problem yet.

jrmlhermitte commented 8 years ago

ok sounds good thanks. I'll search through the hdf5 file for now with some wrapper functions that look like database searches. That way we'll be ready for (2).

If I have time, I might also try to get a local database installed, but that's another beast (I was thinking a local mongoDB?). I mainly asked because I have a feeling a search through the hdf5 file may take a while. We'll see.

But yes, I agree, I think you're on the right track. I never thought of saving data this way before but when you get used to it, you find that it's quite convenient and in the long run, is more scalable. I hope you can all withstand the user nagging in the meantime from users like me :-P. thanks again!

jrmlhermitte commented 8 years ago

ok one more question: which entry in this metadata is the filename? there are quite a few uids and the ones I thought may have been so I coulnd't find in my file structure. if it's not there, how would i save the filename into this database? thanks!

ericdill commented 8 years ago

filename of what, exactly?

jrmlhermitte commented 8 years ago

It's the filename of the detector files saved. Currently, Yugang gave us this code to allow us to extract them:

from filestore.path_only_handlers import RawHandler
def get_filenames(hdr, detector):
    '''Get the filenames for a header for the EIGER images. If 
        not an EIGER data set (no EIGER), return empty list.'''
    events = get_events(hdr, handler_overrides={detector: RawHandler})
    fns = list()
    for ev in events:
        hh = ev['data']
        if 'eiger4m_single_image' in hh:
            fns.append(hh['eiger4m_single_image'][0])
    return fns

I tried copying and pasting a subset of this into the suitcase export code and can't get it to work. Is this the right way to go about this? (if this should be another issue let me know). I can also paste my attempt and the error. It basically comes from the for ev in events line, says there's a key error in looking for descriptor.

sorry this is a little vague

ericdill commented 8 years ago

Yeah I'd like to see your attempt and the error. That would help.

jrmlhermitte commented 8 years ago

ok, here is the code (below the desc_group line):

                desc_group = group.create_group(descriptor['uid'])
                # extra code to fetch filename (if EIGER file)
                events2 = list(get_events(header, handler_overrides={detector: RawHandler}))

                for ev in events2:
                    hh = ev['data']
                    if 'eiger4m_single_image' in hh:
                        filename = hh['eiger4m_single_image'][0]
                        _safe_attrs_assignment(dataset, {'filename' : filename})

and here is the current error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-39-f5b54e8b1574> in <module>()
----> 1 export(cfndb, "/home/lhermitte/notebooks/notebook-B004-mar262016/database-B004.hd5")

<ipython-input-38-87a44f4fd813> in export(headers, filename, savedata)
     42                 desc_group = group.create_group(descriptor['uid'])
     43                 # extra code to fetch filename (if EIGER file)
---> 44                 events2 = list(get_events(header, handler_overrides={detector: RawHandler}))
     45 
     46                 for ev in events2:

/opt/conda_envs/analysis/lib/python3.4/site-packages/databroker/databroker.py in get_events(headers, fields, fill, handler_registry, handler_overrides)
    364         fields = []
    365     fields = set(fields)
--> 366     _check_fields_exist(fields, headers)
    367 
    368     for header in headers:

/opt/conda_envs/analysis/lib/python3.4/site-packages/databroker/databroker.py in _check_fields_exist(fields, headers)
    634         if stop is not None:
    635             all_fields.update(header['stop'])
--> 636         for descriptor in header['descriptors']:
    637             all_fields.update(descriptor['data_keys'])
    638             objs_conf = descriptor.get('configuration', {})

KeyError: 'descriptors'
ericdill commented 8 years ago

Are you in a notebook?

Can you drop into pdb and print the contents of header and report it here?

ericdill commented 8 years ago

Or just wrap line 44 in a try/except KeyError and print the header that is erroring

jrmlhermitte commented 8 years ago

I'm in a notebook. I tried moving to ipython but received some other error (let me know if you prefer that).

All headers are errroring, here is one (I've replaced some names with 'x' s ):

{'start': {'measure_type': 'xxxxxxxxxxxx', 'sample': {'name': 'xxxxxxxxxxxxxx', 'y': -0.1821799999999998, 'holder': 'xxxxxxx', 'x': 0.794786, 'extra': 'xxxxxxxxxxxxxxxxxxxxx'}, 'sequence_ID': 2751.0, 'config': {}, 'scan_id': 13858, 'project': '', 'experiment': 'xxxxxxx', 'uid': 'bab70c5b-6f8a-4f2e-9394-491ef449842e', 'x': 0.794786, 'x_position': 0.794786, 'y_position': -0.1821799999999998, 'holder': 'xxxxxxxxxxx', 'owner': 'xf11id', 'time': 1459160529.530022, 'plan_args': {'detectors': "[EigerSingleTrigger(prefix='XF:11IDB-ES{Det:Eig4M}', name='eiger4m_single', read_attrs=['file', 'stats1', 'stats2', 'stats3', 'stats4', 'stats5'], configuration_attrs=['beam_center_x', 'beam_center_y', 'wavelength', 'det_distance'], monitor_attrs=[])]", 'num': '1', 'delay': '0'}, 'energy_keV': 8.687, 'y': -0.1821799999999998, 'detectors': ['eiger4m_single'], 'beamline_id': 'CHX', 'plan_type': 'Count', 'user': 'CFN', 'name': 'xxxxxxxxx', 'group': 'chx', '_name': 'RunStart', 'exposure_time': 300.0, 'extra': 'xxxxxxxxxxxxxxxxxxxxxxx'}, '_name': 'header', 'stop': {'time': 1459160831.2157428, '_name': 'RunStop', 'uid': '0abf0fa8-c558-455c-bcb3-2daa705c1e1b', 'reason': '', 'exit_status': 'success', 'run_start': 'bab70c5b-6f8a-4f2e-9394-491ef449842e'}}

ericdill commented 8 years ago

can you pprint the header?

import pprint pprint.pprint(header)

ericdill commented 8 years ago

I know I'm being sort of fussy here, but it is so much easier to read, sorry :cry:

jrmlhermitte commented 8 years ago

no you're not and thanks for this lib suggestion, I find it useful! (sorry for my delay, the notebook crashed from all the output, I probably should have added a return statement after the first print :-P)

here is the output:

{'_name': 'header',
 'start': {'beamline_id': 'CHX',
           'config': {},
           'detectors': ['eiger4m_single'],
           'energy_keV': 8.687,
           'experiment': 'XPCS',
           'exposure_time': 300.0,
           'extra': 'xxxxxx',
           'group': 'chx',
           'holder': 'xxxxx',
           'measure_type': 'xxxxx',
           'name': 'xxxxxxxxx',
           'owner': 'xf11id',
           'plan_args': {'delay': '0',
                         'detectors': "[EigerSingleTrigger(prefix='XF:11IDB-ES{Det:Eig4M}', "
                                      "name='eiger4m_single', "
                                      "read_attrs=['file', 'stats1', "
                                      "'stats2', 'stats3', 'stats4', "
                                      "'stats5'], "
                                      "configuration_attrs=['beam_center_x', "
                                      "'beam_center_y', 'wavelength', "
                                      "'det_distance'], monitor_attrs=[])]",
                         'num': '1'},
           'plan_type': 'Count',
           'project': '',
           'sample': {'extra': 'xxxxxxx',
                      'holder': 'xxxxx',
                      'name': 'xxxxxx',
                      'x': 0.794786,
                      'y': -0.1821799999999998},
           'scan_id': 13858,
           'sequence_ID': 2751.0,
           'time': 1459160529.530022,
           'uid': 'bab70c5b-6f8a-4f2e-9394-491ef449842e',
           'user': 'CFN',
           'x': 0.794786,
           'x_position': 0.794786,
           'y': -0.1821799999999998,
           'y_position': -0.1821799999999998},
 'stop': {'exit_status': 'success',
          'reason': '',
          'run_start': 'bab70c5b-6f8a-4f2e-9394-491ef449842e',
          'time': 1459160831.2157428,
          'uid': '0abf0fa8-c558-455c-bcb3-2daa705c1e1b'}}
ericdill commented 8 years ago

Haha yeah, one print would be good :-D

I use the following pattern:

try:
    something_that_raises()
except SomeException as e:
    pprint(helpful_information)
    raise
ericdill commented 8 years ago

What I find confusing is that there are no descriptors in that header. What the heck? Do all of the headers lack a descriptor?

ericdill commented 8 years ago

Can you share the full code that caused this?

jrmlhermitte commented 8 years ago

I'm not sure, it could be I've done something wrong. Here it is thanks!

(https://gist.github.com/ordirules/a0f99e8f5030b8d3f7f45b67b2dd9689)

ericdill commented 8 years ago

Got the code. thanks. I'll try to run this on the CHX kernel and see if I can figure something out. I'll get back to you within about an hour

ericdill commented 8 years ago

Had to go to the post office. Will start looking at this in about 15 min

jrmlhermitte commented 8 years ago

thanks, i just noticed something, descriptors is popped. I just replaced that line with: descriptors = header['descriptors']

seems to work so far. ill keep playing and let you know if i have any other issues. sorry to take your time on this with my one special case :-(

ericdill commented 8 years ago

Ah, are you talking about this line? That would make sense why the code is barfing then. Good catch :-D

jrmlhermitte commented 8 years ago

yep, that's what I meant. I just checked and tried and was able to read just fine. :-) For my purposes, this will work. thanks again for your time :-)

Anyway, I do have one request/comment. Often when acquiring data, a 2D image (or some other supplemental data) will be saved for us in a separate file (and almost every other group). Currently, using blue sky, it is quite difficult to actually find what that file name is supposed to be so that the file containing this data can be found and read. (none of the uids match the file names)

I think that whenever supplemental data is stored outside the json key, a string referring to its filename should be stored in that key, somewhere. Currently, I had to use a complicated workaround that @yugangzhang wrote to help us (thanks Yugang).

I know it's discouraged, but I think it makes sense. At least for something like suitcase. When packaging the user's data, I think it makes more sense to leave the large supplemental files as is in some file structure, and simply give the relative paths (+ filenames) into that structure. That way, for example, when a user wants to update their local db with more data, they have the option of just downloading the metadata and merging the files into their file structure separately etc.

I think it will most likely occur very often that an experimentalist will want to know what the filename of their supplemental data so they can maybe open it with other software or share it etc.

What do you think? You know more about the plans of the databroker, perhaps there is a better way.

Thanks again!

ericdill commented 8 years ago

tl;dr, the things that you are requesting are on our todo list.


There is not a guaranteed 1:1 mapping of an image to a filename in the databroker stack. Especially when we start dumping all data into hdf5 files (for storage reasons). That is one reason why giving a filename back is not guaranteed to be helpful.

I think it makes more sense to leave the large supplemental files as is in some file structure, and simply give the relative paths (+ filenames) into that structure

I am not sure how much that would actually help you here. You would still need to physically move the data from the CHX server to your local drive so that you could access it. It sounds like what you really want is to be able to update the filestore database with a new file location.

I know it's discouraged, but I think it makes sense.

A large part of the reason why it is discouraged is because we cannot reliably support moving data yet. @tacaswell is currently working on adding a move() command into filestore so that we can better support exporting data.

Currently, I had to use a complicated workaround

Part of the reason why it is complicated is that we do not yet support the notion of "moving" files that are already in filestore. Once we have sorted that out (@tacaswell is currently working on it)

jrmlhermitte commented 8 years ago

ok great. I just wanted to give feedback to try to help somewhat with meeting users' needs but sounds like you guys have already considered all this and are working on it. :-)

About the file downloading. I can explain. Currently, what I'm doing is extracting the metadata and then rsync'ing the folders with the data on CHX we've taken onto our local servers. When I read the files, I locally set a parent directory in my routines and extract the relative paths from the filenames saved. However, when we're taking data in real time, what I will sometimes do is create a mount point with sshfs directly into CHX's file structure. Running our code from my laptop/workstation/athome or wherever is then a matter of changing the parent directory. What we're doing requires a bit of tweaking and playing around sometimes so simply using existing notebooks on the nsls2 server might not be enough.

My case might be a little extreme and maybe it's not as common. However, I think it'd be nice for databroker to support it by allowing users to select whether or not to download the large files (like detector files) with the metadata or not. If they opt out, then they're responsible for retrieving these larger files from the beamline.

There is just one issue I'm worried about from my point of view. I'm a little bit worried about the idea of dumping all data into the same hdf5 file, even if it ends up saving space. The way I see it is that as a user coming to a beamline, I expect to extract some data that I need. It could be a processed result, or raw files. With this data, there's also metadata (time, location, sample name etc). I think both these quantities should be kept separate and not be abstracted into the same grab bag of data.

Anyway, that's my point of view, but I'm honestly flexible. You've heard my comment and I trust your actions. I definitely like the overall structure of storing metadata in some general database. It helps reduce confusion. :)

thanks for the info :-)