bluesky / suitcase-core

data export facilities for NSLS-II
https://blueskyproject.io/suitcase
Other
2 stars 13 forks source link

Using non-required keys in export templates #71

Open CJ-Wright opened 7 years ago

CJ-Wright commented 7 years ago

What is the best method for including keys into export templates when those keys might not exist? eg

template = '.../{start[sample_name]}/{start[folder_tag]}.../name.tift'
exporter = LiveTiffExporter(template=template)
RE.subscribe(exporter)
RE(count[det], md={'folder_tag': 'after SAXS'})
RE(cound[det], md={})

I presume that the second RE call will cause a crash as format won't know what to do with the missing folder_tag.

I can imagine users wanting this kind of functionality, where certain tags are used to denote deeper files, but the tags aren't always there.

Potential solution: Scrape the template for the templated areas, extract the needed keys then run doc.get(key, '') to get the values. Finally launder everything through pathlib to remove any // issues. Maybe we should move to a more expressive template rendering library? (jinja2?) This way we could also have more expressive for loops, eg for unknown amounts of auxiliary data.

CJ-Wright commented 7 years ago

Relevent: https://stackoverflow.com/questions/20248355/how-to-get-python-to-gracefully-format-none-and-non-existing-fields https://www.python.org/dev/peps/pep-3101/

CJ-Wright commented 7 years ago

also relevant https://stackoverflow.com/questions/17215400/python-format-string-unused-named-arguments

CJ-Wright commented 7 years ago

Potential strategy:

  1. Use SafeDict class (see here) to allow us to format with impunity (just keep trying to squish data in, its fine if we fail since it will just give back the unformatted chunk)
  2. Right before we go to write the file down create a defaultdict of only str, then call to format_map. This will format everything else as ''
  3. Cleanup any __ which were created in the removal process
  4. Cleanup any other path related problems via laundering through Pathlib
CJ-Wright commented 7 years ago

example:

from collections import defaultdict
import re
from pathlib import Path
class SafeDict(dict):
    def __missing__(self, key):
            return '{' + key + '}'

a = '{sample_name}/{folder_tag}/{analysis_stage}/{human_timestamp}_' \
    '{auxiliary}_{ext}_{ext2}.txt'
b = a.format_map(SafeDict(sample_name='hi'))
c = b.format_map(SafeDict(analysis_stage='raw', human_timestamp='time',
                          auxiliary='diff_x=.1'))

d = c.format_map(defaultdict(str))
print(d)
pattern_after_substitution= re.sub(r"\_\_+", "_", d)
e = pattern_after_substitution.replace('_.', '.')
print(e)
f = Path(e).as_posix()
print(f)
hi//raw/time_diff_x=.1__.txt
hi//raw/time_diff_x=.1.txt
hi/raw/time_diff_x=.1.txt

Edit: better regex (this is what happens when you don't go with the highest voted result on stack overflow)

danielballan commented 7 years ago

Thanks for finding all these links. I think custom formatted in PEP 3101 looks like the most promising solution.

CJ-Wright commented 7 years ago

I think the plan going forward (which is somewhat agnostic to how we actually do the subs/cleaning) is:

  1. Take in Eventify(ed) raw data (so we have access to the top level metadata)
  2. Take in raw data (so we have access to event level data (motor positions, temp, etc.))
  3. Take in calculated human readable timestamps from events
  4. Create a stream of rendered templates using map and lambda s, **x: s.format_map(SafeDict(**x))

Now per analyzed data output type (tiff, dark_sub tiff, mask, iq, gr, etc.)

  1. Take in Eventify(ed) analysis data (so we have access to the top level metadata for the analysis)
  2. Take in an extension
  3. Render rendered template with analysis data using map and lambda s, **x: s.format_map(SafeDict(**x))
  4. Clean template with map and a cleaning function
  5. use zip and map to write to file with standard saving functions (save_output, fit2d_save, imsave, etc.)

Side note, on the cleaning side we may want to replace unrendered string chunks with zonk or some other default string. This way we can look for anything which fits the pattern _bla=zonk_ and replace it with _. We can then replace any remaining zonk (like in the filename) with '' and then continue on our normal __ and path based cleaning.

CJ-Wright commented 7 years ago

Here are two classes of note, the first partially renders the template, the other replaces all un-renderable sections with ''

class PartialFormatter(string.Formatter):
    def get_field(self, field_name, args, kwargs):
        # Handle a key not found
        try:
            val = super(PartialFormatter, self).get_field(field_name, args,
                                                          kwargs)
            # Python 3, 'super().get_field(field_name, args, kwargs)' works
        except (KeyError, AttributeError):
            val = '{' + field_name + '}', field_name
        return val

    def format_field(self, value, spec):
        # handle an invalid format
        if value is None:
            return spec
        try:
            return super(PartialFormatter, self).format_field(value, spec)
        except ValueError:
            return value[:-1] + ':' + spec + value[-1]

class PartialFormatterCleaner(string.Formatter):
    def get_field(self, field_name, args, kwargs):
        # Handle a key not found
        try:
            val = super(PartialFormatterCleaner, self).get_field(field_name,
                                                                 args,
                                                                 kwargs)
            # Python 3, 'super().get_field(field_name, args, kwargs)' works
        except (KeyError, AttributeError):
            val = '', field_name
        return val

    def format_field(self, value, spec):
        # handle an invalid format
        if value is None:
            return spec
        try:
            return super(PartialFormatterCleaner, self).format_field(value,
                                                                     spec)
        except ValueError:
            return ''

The basic idea is to do a bunch of formatting, then use the second class with defaultdict(str) to clear out the un-rendered sections. Then we just need to clean up any other parts and we're done.

danielballan commented 7 years ago

Nice. I was literally just working on this to try to squeeze it into this release for you. I have some opinions (bounced off of Tom). I'll work off of this and put up a PR for you to review.

CJ-Wright commented 7 years ago

Cool

CJ-Wright commented 7 years ago

You may find https://github.com/xpdAcq/xpdAn/pull/107/files#diff-3bdf813d65d2df127ea85da3dc3c3364 helpful. Specifically the template, and its janky cleanup