Closed brownworth closed 6 years ago
What does each point stand for?
These codes are detailed in https://en.wikipedia.org/wiki/METAR#Cloud_reporting.
There may be other codes that will require similar formatting. This may require the regex library to parse out all of the codes.
Doing some more research, according to this link: https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/LCD_documentation.pdf
there are single letter codes embedded in some of the data. In some places where we are seeing "s" after a number (i.e. 0.32s) it may mean that the data is suspect. If this is the case, we can either leave it as the number; convert it to np.nan
; or as @smindinvern suggested, use a forward fill (.ffill()
) to interpolate. I'm ok doing any of the above, but I would like to be consistent for documentation purposes.
There's an even more extreme option, I guess, which is that we could discard each row of data which has a 'suspect datum' in it, equivalent I guess to setting to np.nan
and then doing e.g. df.dropna(axis='index', how='any')
. This would have the advantage of ensuring that we don't make inferences based on potentially bad data, but would cut out a chunk of our dataset.
I can go either way. If the suspect data shows some significant outliers, then I would recommend dropping them. If not, having the non-interpolated, but still official, data would give authenticity that we wouldn't have to explain away later.
It looks like the original task at issue here has been completed by @mtellis2 and merged into master
. Can this issue be closed, then? Do we want to create a new issue for discussing what to do with the suspect data entries?
@smindinvern Yeah that sounds good, I meant to close this issue. Also creating a new issue for suspect data entries will be great.
This will require converting strings under the Hourly Sky Conditions that look like this:
SCT:04 14 OVC:08 38,2.50,-RA:02 BR:1