PatriciaTanzer commented 7 years ago

In interest of making reading easier, I have a suggestion.

I've altered the first part of the GSOWeather.ipynb notebook in the manner I'm suggesting so you can see. In this case I've made a display method which takes an array as an argument, which can be fetched by either of our getWeather methods (one says All, and another says Hourly). The rest of the notebook can be easily altered in a similar manner. This is just to make it easier to maintain in the future. Thoughts?

PatriciaTanzer commented 7 years ago

Edit: This is in develop

smindinvern commented 7 years ago

Can you link the commits you're referring to?

PatriciaTanzer commented 7 years ago

https://github.com/UNCG-CSE/Library-Computer-Usage-Analysis/blob/develop/src/GSOWeather.ipynb

If that doesn't work tell me and I'll try something else

14

That's the last one I did. Do you want the whole list?

Edit: The first of these is here: https://github.com/UNCG-CSE/Library-Computer-Usage-Analysis/commit/7f1c57326189bcc87466613861a94959d1d18bf5 But I had a copy/paste bug in that one, which was finally eradicated in the #14 , which is also the link at the top

smindinvern commented 7 years ago

OK, thanks, just wanted to know what version to diff before looking into this.

brownworth commented 7 years ago

I created the branch 'clean-notebook' to help visualize the diffs.

smindinvern commented 7 years ago

OK, here's the diff I extracted from 7f1c573261..f2e52e3768, comments follow:

--- a.py    2017-09-07 19:20:09.001734000 -0400
+++ b.py    2017-09-07 19:20:16.465464000 -0400
@@ -11,25 +11,18 @@ import pandas as pd

 # In[ ]:

-gsoDataAll = pd.read_csv(r'../data/1052640.csv',low_memory=False)
+#Defining variables. hourlyColumns may be altered later, but this is what we are using for now

 # In[ ]:

-gsoDataAll.columns.values
-
-
-# This is just a smaller subset of the columns. Daily and Monthly rollups were ignored. Fahrenheit temps used instead of Celcius.
-
-# In[ ]:
-
 hourlyColumns = ['DATE',
 'REPORTTPYE',
 'HOURLYSKYCONDITIONS',
 'HOURLYVISIBILITY',
 'HOURLYPRSENTWEATHERTYPE',
-'HOURLYDRYBULBTEMPF',
 'HOURLYWETBULBTEMPF',
+'HOURLYDRYBULBTEMPF',
 'HOURLYDewPointTempF',
 'HOURLYRelativeHumidity',
 'HOURLYWindSpeed',
@@ -45,19 +38,55 @@ hourlyColumns = ['DATE',

 # In[ ]:

-def getWeatherData():
-    return pd.read_csv(r'../data/1052640.csv',usecols = hourlyColumns,low_memory = False)
+#Defining functions - all together so we can see them and know what we have to work with
+# without scrolling through entire program
+
+
+# In[ ]:
+
+def getAllWeatherData():
+    return pd.read_csv(r'../data/1052640.csv',low_memory=False)
+
+
+# In[ ]:
+
+def getHourlyWeatherData():
+    return pd.read_csv(r'../data/1052640.csv',usecols = hourlyColumns, low_memory = False)
+
+
+# In[ ]:
+
+def displayWeatherData(array):
+    print array.columns.values

 # In[ ]:

-gsoData = getWeatherData()
+#delaring variables from the functions - don't need to know exactly what is in them to use them
+gsoDataAll = getAllWeatherData()
+gsoDataHours = getHourlyWeatherData()
+
+
+# In[ ]:
+
+#How to use the display method
+displayWeatherData(gsoDataAll)
+displayWeatherData(gsoDataHours)
+
+
+# This is just a smaller subset of the columns. Daily and Monthly rollups were ignored. Fahrenheit temps used instead of Celcius.
+
+# In[ ]:
+
+
+gsoData = getHourlyWeatherData()

 # Verifying the columns.

 # In[ ]:

+
 gsoData.info()

@@ -65,6 +94,7 @@ gsoData.info()

 # In[ ]:

+
 gsoData.rename(columns = {'REPORTTPYE':'REPORTTYPE'}, inplace=True)

@@ -72,7 +102,9 @@ gsoData.rename(columns = {'REPORTTPYE':'

 # In[ ]:

+
 gsoData[gsoData.REPORTTYPE == 'SOD']
+    

 # Dropping **S**tart **O**f **D**ay

Comments:

-def getWeatherData():
-    return pd.read_csv(r'../data/1052640.csv',usecols = hourlyColumns,low_memory = False)
+#Defining functions - all together so we can see them and know what we have to work with
+# without scrolling through entire program
+
+
+# In[ ]:
+
+def getAllWeatherData():
+    return pd.read_csv(r'../data/1052640.csv',low_memory=False)
+
+
+# In[ ]:
+
+def getHourlyWeatherData():
+    return pd.read_csv(r'../data/1052640.csv',usecols = hourlyColumns, low_memory = False)

The downside of doing this is that fetching all data and hourly data involves reading the data file twice. It might get cached, but I wouldn't count on it.

Why not just read all the data and filter it down to what we want? In this case just extract the columns of interest, e.g. weatherData[hourlyColumns]. Presumably we'll be doing lots of transformations on the data anyway, so having one canonical copy of the data resident in memory and deriving everything else from that just makes sense to me.

+def displayWeatherData(array):
+    print array.columns.values

This seems to just print the column names. If that's the intent, I think the name is a bit misleading. Also, presumably this will work for any DataFrame, so maybe something like showColumns()?

These changes do improve readability, but I think that working with an IPython notebook is itself the most severe constraint on readability. Unless I'm vastly underestimating the scope of this project, keeping everything in one file is going to be nightmarish. That being said, for experimenting with a data set (which I presume is precisely what this notebook is for), I think it's fine.

One last thought that I had is that this data needs to be cleaned up a bit before we can use it. I noticed some of the temperature values included non-numeric characters and such, which I believe get converted to NaN before plotting. For cases like that it may be better to just strip the non-numerics out.

I'd written a bit of code to do all that at some point--I'll see if i can dig it up...

smindinvern commented 7 years ago

One other point:

+gsoDataAll = getAllWeatherData()
+gsoDataHours = getHourlyWeatherData()

snip

+gsoData = getHourlyWeatherData()

brownworth commented 7 years ago

we can probably do pd.to_numeric(DataFrame,errors='coerce').dropna(how='any') to get rid of any NaNs.

brownworth commented 7 years ago

as for readability, my primary goal with both ./src/GSOWeather.ipynb and ./src/LibraryData.ipynb was to get a start on importing the data based upon the files given. I'm sure that the final product will be far more concise.

That said, I wouldn't want us to get too terribly bogged down with whether or not these files look good. It's merely to test functionality, and to see where things might break. For example, the ./src/LibraryData.ipynb file breaks when there aren't unique datestamps for a given machine name. That's no bueno.

brownworth commented 7 years ago

One more thing: Exporting to python can also be done from within the notebook while it is running:

export

and clearing the output can also be done here:

clear

smindinvern commented 7 years ago

@brownworth the problem isn't with NaNs, those get filled over with DataFrame.ffill(). The problem is when you have a temperature value recorded as e.g. 73s. The question is: do we assume that this was supposed to be 73, or discount it as invalid data? NWS doesn't have any guidance on this particular issue in the documentation for this particular data set.

brownworth commented 7 years ago

I would hesitate to use .ffill() unless absolutely necessary. It changes data, by replacing all np.NaN with the previous non-null value in the column. Since we are using samples without a uniform or periodic rate, I would advocate for dropping invalid data. That said, if we end up resampling the data to align with the library data, we may end up using .ffill() anyway.

I've just updated the ./src/GSOWeather.ipynb in the 'brown' branch with my take on using pd.to_numeric(errors='coerce'). I left the output for the necessary cells intact.

it may be irrelevant, as I'm not sure whether or not we should be using dry bulb or wet bulb temp.

smindinvern commented 7 years ago

Since we are using samples without a uniform or periodic rate, I would advocate for dropping invalid data. That said, if we end up resampling the data to align with the library data, we may end up using .ffill() anyway.

Or some other interpolation method. For this use case at least linear would likely be appropriate.

PatriciaTanzer commented 6 years ago

Going to create more specific issues as milestones as suggested in class. Recommend that we put group questions here.

UNCG-CSE / Library-Computer-Usage-Analysis

Functions in notebook #10

14