UNCG-CSE / Library-Computer-Usage-Analysis

The University Libraries at UNCG currently track the state of a computer, determining whether or not a particular computer is in use. This data is compiled into a database, and a web app pulls from this database to show a map and number of available computers. As of Fall 2017, the data had not been used to determine which computers are used more frequently, aside from counting the number of times a computer transitions into/away from the 'in-use' state. This project attempts to correlate the usage of these computers with various factors, including: campus scheduling, equipment configuration, placement, population in the library, and area weather. Using this data, this project also uses machine learning to determine the best placement of computers for future allocation, and possible reconfiguration of equipment and space.
1 stars 1 forks source link

Interpret weather codes in Weather Data #21

Closed brownworth closed 6 years ago

brownworth commented 6 years ago

This will require converting strings under the Hourly Sky Conditions that look like this: SCT:04 14 OVC:08 38,2.50,-RA:02 BR:1

PatriciaTanzer commented 6 years ago

What does each point stand for?

brownworth commented 6 years ago

These codes are detailed in https://en.wikipedia.org/wiki/METAR#Cloud_reporting.

There may be other codes that will require similar formatting. This may require the regex library to parse out all of the codes.

brownworth commented 6 years ago

Doing some more research, according to this link: https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/LCD_documentation.pdf there are single letter codes embedded in some of the data. In some places where we are seeing "s" after a number (i.e. 0.32s) it may mean that the data is suspect. If this is the case, we can either leave it as the number; convert it to np.nan; or as @smindinvern suggested, use a forward fill (.ffill()) to interpolate. I'm ok doing any of the above, but I would like to be consistent for documentation purposes.

smindinvern commented 6 years ago

There's an even more extreme option, I guess, which is that we could discard each row of data which has a 'suspect datum' in it, equivalent I guess to setting to np.nan and then doing e.g. df.dropna(axis='index', how='any'). This would have the advantage of ensuring that we don't make inferences based on potentially bad data, but would cut out a chunk of our dataset.

brownworth commented 6 years ago

I can go either way. If the suspect data shows some significant outliers, then I would recommend dropping them. If not, having the non-interpolated, but still official, data would give authenticity that we wouldn't have to explain away later.

smindinvern commented 6 years ago

It looks like the original task at issue here has been completed by @mtellis2 and merged into master. Can this issue be closed, then? Do we want to create a new issue for discussing what to do with the suspect data entries?

mtellis2 commented 6 years ago

@smindinvern Yeah that sounds good, I meant to close this issue. Also creating a new issue for suspect data entries will be great.