DOI-USGS / dataretrieval-python

Python package for retrieving water data from USGS or the multi-agency Water Quality Portal
https://doi-usgs.github.io/dataretrieval-python/
Other
159 stars 37 forks source link

missing datetime for gwlevel service pulls #82

Closed toanderod closed 1 year ago

toanderod commented 1 year ago

Groundwater level data retrievals result in missing datetimes for levels that have no time associated with the date of measurement. For me, the result was a small fraction of the total available groundwater levels from NWIS being available from dataretrieval.

ex. for the following retrieval 1440 of the 4473 groundwater levels measurements are missing dates: gw_df = nwis.get_record(statecd='sd',service='gwlevels',localAquiferCd='SD:331MDSN', start='1900-01-01',access='3')

elbeejay commented 1 year ago

Thanks for raising this issue @toanderod - it looks like the datetime information is still available in the dataframe but is not being correctly handled into the datetime column because the time information (as you've noted) is missing - specifically because it is denoted as a NaN value. This ends up creating problems when we try processing the datetime information into the datetime index, as I think the NaN is getting turned into a NaT (Not a Time?) value.

Will investigate.

elbeejay commented 1 year ago

Upon investigation, the function is working as intended for sites with incomplete datetime information. When I run the command you provided as an example @toanderod:

df=nwis.get_record(statecd='sd',service='gwlevels',localAquiferCd='SD:331MDSN', start='1900-01-01',access='3')

I get the following warning:

UserWarning: Warning: 1440 incomplete dates found, consider setting datetime_index to False.

So if you following the suggestion made in the warning, then your query becomes the following:

df=nwis.get_record(statecd='sd',service='gwlevels',localAquiferCd='SD:331MDSN', start='1900-01-01',access='3',datetime_index=False)

And as expected, this switches the indexing for the dataframe and uses integers instead:

>>> df.head()
  agency_cd          site_no site_tp_cd      lev_dt lev_tm lev_tz_cd  lev_va  sl_lev_va  sl_datum_cd lev_status_cd lev_agency_cd lev_dt_acy_cd  lev_acy_cd lev_src_cd lev_meth_cd lev_age_cd
0      USGS  431232103513501         GW  1956-09-19    NaN       UTC   -20.0        NaN          NaN             1           NaN             D         NaN        NaN           O          A
1      USGS  431753103492601         GW  1956-09-19    NaN       UTC  -204.0        NaN          NaN             1           NaN             D         NaN        NaN           G          A
2      USGS  431753103492601         GW  1957-05-29    NaN       UTC  -206.0        NaN          NaN             1           NaN             D         NaN        NaN           G          A
3      USGS  431808103492001         GW  1956-09-19    NaN       UTC  -198.0        NaN          NaN             1           NaN             D         NaN        NaN           G          A
4      USGS  431810103491001         GW  1975-09-26    NaN       UTC  -219.0        NaN          NaN             1           NaN             D         NaN        NaN           G          A

We are receptive to alternative solutions and suggestions on how to handle this scenario - but as you can see, this situation has been considered. The reasoning for this, as I understand it, is that we do not want to make assumptions about the time a measurement was collected. Our datetime column requires information to populate a datetime field in the format YYYY-MM-DD HH:MM:SS+HH:MM and when there is no time information, we'd be forced to make some sort of guess or assumption about the time the measurement was taken. Rather than do this, we keep the original columns in the dataframe and give the user the option to instead index the data by integer in the order it was returned by the webservice.

Hope that helps, and again, we are happy to hear alternative ideas or suggestions.