NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a developer, I want to understand why data ingest takes so much longer with an unsorted CSV than with a sorted CSV #21

Open HankHerr-NOAA opened 3 months ago

HankHerr-NOAA commented 3 months ago

This relates to VLab User Support ticket #131828. The unsorted and sorted CSV files have been uploaded here:

https://drive.google.com/drive/folders/1-mBAjDUNf9COiw0dzly7mJ2aQg2BDSFD

Using a standalone pointing to a database and running on the NWC ised-dev1 machine, it took 1h 6m to complete the evaluation using unsorted data (where time series are written by time first, and then feature). Using the sorted data (where time series are written by feature, first, and then time), the evaluation took 2m 21s. Both evaluations were run on a freshly cleaned database. The declaration using the sorted data is below; just modify the predicted source accordingly.

Why such a stark difference? If it points to a code change to make, this ticket can be resolved once that change is made. Otherwise, this ticket can be resolved once we understand the underlying cause and decide that no change is needed.

Thanks,

Hank

=====================================

label: HEFS Evaluations RSA
observed:
  label: USGS Streamflow Observations
  sources:
  - interface: usgs nwis
    uri: https://nwis.waterservices.usgs.gov/nwis/iv
  variable:
    name: '00060'
  feature_authority: nws lid
  type: observations
predicted:
  label: HEFS RSA Forecast Test
  sources: [omitted]/sorted_ALL_HEFS.tgz
  variable:
    name: QINE
  feature_authority: nws lid
  type: ensemble forecasts
features:
  - {observed: '11335000', predicted: MHBC1}
reference_dates:
  minimum: 2022-12-01T11:00:00Z
  maximum: 2023-03-31T12:00:00Z
valid_dates:
  minimum: 2022-12-01T11:00:00Z
  maximum: 2023-04-09T12:00:00Z
reference_date_pools:
  period: 1
  frequency: 1
  unit: days
lead_times:
  minimum: 0
  maximum: 72
  unit: hours
time_scale:
  function: mean
  period: 24
  unit: hours
values:
  minimum: 0.0
probability_thresholds: 
  values: [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]
  operator: greater
  apply_to: observed
metrics:
  - name: sample size
  - name: mean error
  - name: box plot of errors
  - name: mean square error
  - name: brier skill score
ensemble_average: mean
duration_format: days
output_formats:
  - format: csv2
  - format: png
  - format: pairs
HankHerr-NOAA commented 3 months ago

For local, NWC access to data, evaluation declarations, and output, see the directory issue131828 in the standard location.

Hank

HankHerr-NOAA commented 3 months ago

My test was run using revision 20240627-b58855f-dev in a repo with the remote just changed to GitHub.

Hank

james-d-brown commented 2 months ago

( The underlying reason is ingest of one, continuous timeseries vs. a very large number of very small (one-event) time-series, but there is a question beneath that concerning why this difference in topology makes such a big difference to ingest time - there is some kind of ingest contention, probably related to source locking, but TBD. )