bnlawrence / cfs

Rebooting cfstore, leaner and meaner
0 stars 1 forks source link

Unexpected result with cf.wi and cf.cellwi #30

Closed bnlawrence closed 2 weeks ago

bnlawrence commented 2 weeks ago

I am trying to subset a manifest file using the bounds that are stored in the database.

The particular test that is manifesting the issue will fail with the pytest -s -x tests/realistic in the within branch.

The code which is causing the error can be found in cfs/db/cfa_tools.py

def get_quark_field(instance, start_date, end_date):
    """ Given a CFS manifest field (i.e. an instance of the Manifest class 
    found in the django models.py) and a pair of bounding dates, return
    a CF field instance with the information needed to construct a 
    subspace manifest.
    : instance : A CFS manifest instance
    : start_date : A date in a cf.Data instance
    : end_date : A date in a cf.Data instance.
    : output : A CF field instance with the correct bounds, and 
    a set of fragment ids in the field data.
    """

    fld = cf.Field(properties={'long_name':'fragments'})

    fragments = np.array([f.id for f in instance.fragments.files.all()])
    T_axis= fld.set_construct(cf.DomainAxis(fragments.size))
    fld.set_data(fragments, axes=T_axis)

    binary_stream = io.BytesIO(instance.bounds)
    binary_stream.seek(0)
    bounds = np.load(binary_stream)

    timedata = np.mean(bounds, axis=1)

    dimT = cf.DimensionCoordinate(properties={'standard_name':'time',
                    'units':cf.Units(instance.units,calendar=instance.calendar)},
                data = timedata,
                bounds = cf.Bounds(data=bounds)
                )
    bounds_range(dimT.bounds)
    fld.set_construct(dimT, axes=T_axis)
    nf, nt = len(fragments), len(timedata)
    if nf != nt:
        raise RuntimeError(
            'Number of of manifest fragments ({nf}) not equal to time data length ({nt}).')
    print('Subspacing using cellwi to ',start_date, end_date)
    quark = fld.subspace(time=cf.cellwi(start_date, end_date))
    dimT = quark.dimension_coordinate('T')
    bounds_range(dimT.bounds)
    return quark

This code, when run in the test yields this output (before failing the assertion in the error below):

... 
Bounds range  [[1950-01-01 06:00:00, 1950-02-01 00:00:00]] 360_day [[2014-12-01 06:00:00, 2015-01-01 00:00:00]] 360_day
Subspacing using cellwi to  1950-01-01 06:00:00 360_day 1970-12-10 00:00:00 360_day
Bounds range  [[1950-01-01 06:00:00, 1950-02-01 00:00:00]] 360_day [[1970-11-01 06:00:00, 1970-12-01 00:00:00]] 360_day
...

We are expected it to grab the next cell.

The relevant test failure is this one:

 def test_quarks(django_dependencies):
        test_db, ignore , ignore = django_dependencies
        vars = test_db.variable.all()
        v = vars[0]
        td = v.time_domain

        #check get sensibele results from trying to get the same thing
        starting = cf.Data(td.starting, units=cf.Units(td.units, calendar=td.calendar))
        ending = cf.Data(td.ending, units=cf.Units(td.units, calendar=td.calendar))
        manifest = v.in_manifest
        assert manifest is not None
        quark = test_db.manifest.make_quark(manifest, starting, ending)
        assert quark.id == manifest.id
        assert quark.is_quark is False

        #ok, now see if we can get a subset
        ending = cf.Data(10., units=cf.Units(
            'days since 1970-11-30',calendar='360_day'))
        quark = test_db.manifest.make_quark(manifest, starting, ending)
        assert quark.id != manifest.id
        assert quark.is_quark is True
        print(quark)
        expected = 'cn134a_999_6hr_u_pt_cordex__197012-197012.nc(unavailable)'
        fragments = list(quark.fragments.files.all())
>       assert expected == fragments[-1]
E       AssertionError: assert 'cn134a_999_6hr_u_pt_cordex__197012-197012.nc(unavailable)' == <File: cn134a_999_6hr_u_pt_cordex__197011-197011.nc(unavailable)>

which occurs because it's not getting to the next cell where that fragment file should have been present.

davidhassell commented 2 weeks ago

Hi Bryan,

... 
Bounds range  [[1950-01-01 06:00:00, 1950-02-01 00:00:00]] 360_day [[2014-12-01 06:00:00, 2015-01-01 00:00:00]] 360_day
Subspacing using cellwi to  1950-01-01 06:00:00 360_day 1970-12-10 00:00:00 360_day
Bounds range  [[1950-01-01 06:00:00, 1950-02-01 00:00:00]] 360_day [[1970-11-01 06:00:00, 1970-12-01 00:00:00]] 360_day
...

But this looks right to me ... You asked for cells that lie completely inside 1950-01-01 06:00 to 1970-12-10 00:00, and that's what you got I realize, however, that the "completely" part of this was not made wholly clear :)

If you inspect the query you see that it's just a convenience function for creating a mode complicated Query object:

>>> print(cf.cellwi(start_date, end_date))
<CF Query: [lower_bounds(ge 1950-01-01 06:00:00) & upper_bounds(le 1970-12-10 00:00:00)]>

We can create a new "cell overlaps" query as follows, that should do what you want (edited):

>>> q1 = cf.ge(start_date, attr='upper_bounds')
>>> q2 = cf.le(end_date, attr='lower_bounds')
>>> cell_overlaps = q1 & q2
>>> cell_overlaps
<CF Query: [upper_bounds(ge 1950-01-01 06:00:00) & lower_bounds(le 1970-12-10 00:00:00)]>

and then

quark = fld.subspace(time=cell_overlaps(start_date, end_date))

I shall create this as a new cf.cell_overlaps function.

davidhassell commented 2 weeks ago

cf.cell_overlaps will be in the next version of cf-python: https://github.com/NCAS-CMS/cf-python/issues/824

bnlawrence commented 2 weeks ago

Great. For the moment I have stuffed an explicit version into my code, and will replace it when the new CF comes out. The explicit version passes my tests!