BASIN-3D / basin3d

The core BASIN-3D application that uses a generalized data synthesis model that applies across a variety of earth science observation types (hydrology, geochemistry, climate etc.). https://basin3d.readthedocs.io/
Other
12 stars 4 forks source link

Refactor get_timeseries data to support hi res #132

Closed vchendrix closed 2 years ago

vchendrix commented 2 years ago

Currently get_timeseries_data can only handle dates and not timestamps with utc offsets. In order the support instananeous values we need to remove the timezone information from the timeseries timestamps and assume local time for the monitoring feature. The git diff below shows a proof of concept for this implementation.

Closes #131

Type of change

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Test Configuration

PR Self Evaluation

strikethrough things that don’t make sense for your PR

vchendrix commented 2 years ago

@dschristianson There is one issue with this DRAFT pull request. If I drop the NaNs that means metacata.records for reach variable may be in accurate. I chose to only drop NaNs for aggregation_duration < DAY. However, my guess it that I should update metadata.records for each variable after I drop the NaNs. Do you think that is the right approach. If so, should I just drop NaNs for all cases?

dschristianson commented 2 years ago

@dschristianson There is one issue with this DRAFT pull request. If I drop the NaNs that means metacata.records for reach variable may be in accurate. I chose to only drop NaNs for aggregation_duration < DAY. However, my guess it that I should update metadata.records for each variable after I drop the NaNs. Do you think that is the right approach. If so, should I just drop NaNs for all cases?

@vchendrix: It should be fine. The records are counted from the return so it will reflect the actual data values.

vchendrix commented 2 years ago

@dschristianson There is one issue with this DRAFT pull request. If I drop the NaNs that means metacata.records for reach variable may be in accurate. I chose to only drop NaNs for aggregation_duration < DAY. However, my guess it that I should update metadata.records for each variable after I drop the NaNs. Do you think that is the right approach. If so, should I just drop NaNs for all cases?

@vchendrix: It should be fine. The records are counted from the return so it will reflect the actual data values.

@dschristianson my main concern is that the user of the HDF file will not be able to account for the discrepency if they inspect it.

@vchendrix: it is still the same number of non-nan values. This is what the data users care about. Same situation occurs with DAY as well.

vchendrix commented 2 years ago

Thanks for getting this out so fast, @vchendrix!

To document our conversation yesterday. We currently need to support temporal_frequency / aggregation_duration: MINUTE (which is mapped to instantaneous in our WF-SFA use case), HOUR, DAY, MONTH. We'll need more work to fully think thru supporting aggregation_duration = NONE. Please add anything I missed!

Few items for this PR:

@dschristianson Thanks for capturing out conversation and reviewing this draft. I will follow up on your points above

vchendrix commented 2 years ago

Default day is implemented in the pydantic schema +++++++++++++++++++++++++++++++++ Val Hendrix @.*** Lawrence Berkeley National Lab

Mail Stop: 50B-2239 Room: 50B-2258E Phone: (510) 495-2905 Pronouns: she/hers +++++++++++++++++++++++++++++++++

On Fri, May 27, 2022 at 9:32 AM D.S. Christianson @.***> wrote:

@.**** requested changes on this pull request.

Thx for quick turnaround, @vchendrix https://github.com/vchendrix. One more item.

In basin3d/synthesis.py https://github.com/BASIN-3D/basin3d/pull/132#discussion_r883770063:

@@ -477,7 +477,7 @@ def get_timeseries_data(synthesizer: DataSynthesizer, location_lat_long: bool =

  • start_date Optional parameters for MeasurementTimeseriesTVPObservation:
  • end_date
    • aggregation_duration = resolution = DAY (only DAY is currently supported)
    • aggregation_duration = resolution in (SECOND, MINUTE, HOUR, DAY, MONTH) Default: DAY

To implement default: DAY which I think we should do, I suggest the following be made around new line 490 (sorry, again GitHub won't let me directly suggest).

  • Since temporal_resolution is optional, if it is None (presumably by default), then make the kwargs['aggregation_duration'] = TimeFrequencyEnum.DAY
  • I suggest only permitting SECOND, MINUTE, HOUR, DAY, MONTH for temporal_resolution by adding a check for these values. If check fails, fail to synthesize.

Otherwise, since the hardcoding of DAY in MeasurementTimeseriesTVPObservation in basin3d/core/synthesis.py was removed, we would be supporting all temporal frequencies / duration_aggregations and not explicitly handling the NONE case.

— Reply to this email directly, view it on GitHub https://github.com/BASIN-3D/basin3d/pull/132#pullrequestreview-987745842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL4M4MQ3OXETKOBB2AI2YLVMD2KDANCNFSM5XB6Y4UA . You are receiving this because you were mentioned.Message ID: @.***>

vchendrix commented 2 years ago

I think we leave since we are treating it as the same right now in get_timeseries_data. We can address it later. +++++++++++++++++++++++++++++++++ Val Hendrix @.*** Lawrence Berkeley National Lab

Mail Stop: 50B-2239 Room: 50B-2258E Phone: (510) 495-2905 Pronouns: she/hers +++++++++++++++++++++++++++++++++

On Fri, May 27, 2022 at 9:54 AM D.S. Christianson @.***> wrote:

@.**** commented on this pull request.

In basin3d/synthesis.py https://github.com/BASIN-3D/basin3d/pull/132#discussion_r883785651:

@@ -477,7 +477,7 @@ def get_timeseries_data(synthesizer: DataSynthesizer, location_lat_long: bool =

  • start_date Optional parameters for MeasurementTimeseriesTVPObservation:
  • end_date
    • aggregation_duration = resolution = DAY (only DAY is currently supported)
    • aggregation_duration = resolution in (SECOND, MINUTE, HOUR, DAY, MONTH) Default: DAY

Ah, awesome! Thanks for pointing that out. Any thoughts on restricting the temporal_resolution values?

— Reply to this email directly, view it on GitHub https://github.com/BASIN-3D/basin3d/pull/132#discussion_r883785651, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL4M4NILD4ABGZPVVIHRO3VMD43DANCNFSM5XB6Y4UA . You are receiving this because you were mentioned.Message ID: @.***>

dschristianson commented 2 years ago

I think we leave since we are treating it as the same right now in get_timeseries_data. We can address it later. +++++++++++++++++++++++++++++++++ Val Hendrix @.*** Lawrence Berkeley National Lab Mail Stop: 50B-2239 Room: 50B-2258E Phone: (510) 495-2905 Pronouns: she/hers +++++++++++++++++++++++++++++++++

Apologies that I probably wasn't clear. My suggestion is to restrict temporal_resolution in get_timeseries_data to only the resolutions that we document supporting SECOND, MINUTE, HOUR, DAY, MONTH.

So something like the following around new line 490:

if temporal_resolution not in [TemporalFrequencyEnum.SECOND, TemporalFrequencyEnum.MINUTE, TemporalFrequencyEnum.HOUR, TemporalFrequencyEnum.DAY, TemporalFrequencyEnum.DAY]: raise SynthesisException(f'Specified temporal_resolution {temporal_resolution} is not supported.')

vchendrix commented 2 years ago

I think we leave since we are treating it as the same right now in get_timeseriesdata. We can address it later. +++++++++++++++++++++++++++++++++ Val Hendrix @_.*** Lawrence Berkeley National Lab Mail Stop: 50B-2239 Room: 50B-2258E Phone: (510) 495-2905 Pronouns: she/hers +++++++++++++++++++++++++++++++++

Apologies that I probably wasn't clear. My suggestion is to restrict temporal_resolution in get_timeseries_data to only the resolutions that we document supporting SECOND, MINUTE, HOUR, DAY, MONTH.

So something like the following around new line 490:

if temporal_resolution not in [TemporalFrequencyEnum.SECOND, TemporalFrequencyEnum.MINUTE, TemporalFrequencyEnum.HOUR, TemporalFrequencyEnum.DAY, TemporalFrequencyEnum.DAY]: raise SynthesisException(f'Specified temporal_resolution {temporal_resolution} is not supported.')

The pydantic schema will catch this

dschristianson commented 2 years ago

My understanding is that we're not yet really supportingYEAR and we are not yet supporting NONE. Thus the idea is to explicitly not support these in get_timeseries_data.

vchendrix commented 2 years ago

My understanding is that we're not yet really supportingYEAR and we are not yet supporting NONE. Thus the idea is to explicitly not support these in get_timeseries_data.

I am not aware of any references to 'NONE' in this PR. The TimeFrequencyEnum does not reference it either

As for YEAR, there is not reason to not support YEAR. Some plugins may implement it. I think it is a question for the refactor as to how we communicate to the user which query params are supported by a data source.