Closed PatriciaTanzer closed 6 years ago
Edit: This is in develop
Can you link the commits you're referring to?
https://github.com/UNCG-CSE/Library-Computer-Usage-Analysis/blob/develop/src/GSOWeather.ipynb
If that doesn't work tell me and I'll try something else
That's the last one I did. Do you want the whole list?
Edit: The first of these is here: https://github.com/UNCG-CSE/Library-Computer-Usage-Analysis/commit/7f1c57326189bcc87466613861a94959d1d18bf5 But I had a copy/paste bug in that one, which was finally eradicated in the #14 , which is also the link at the top
OK, thanks, just wanted to know what version to diff before looking into this.
I created the branch 'clean-notebook' to help visualize the diffs.
OK, here's the diff I extracted from 7f1c573261..f2e52e3768, comments follow:
--- a.py 2017-09-07 19:20:09.001734000 -0400
+++ b.py 2017-09-07 19:20:16.465464000 -0400
@@ -11,25 +11,18 @@ import pandas as pd
# In[ ]:
-gsoDataAll = pd.read_csv(r'../data/1052640.csv',low_memory=False)
+#Defining variables. hourlyColumns may be altered later, but this is what we are using for now
# In[ ]:
-gsoDataAll.columns.values
-
-
-# This is just a smaller subset of the columns. Daily and Monthly rollups were ignored. Fahrenheit temps used instead of Celcius.
-
-# In[ ]:
-
hourlyColumns = ['DATE',
'REPORTTPYE',
'HOURLYSKYCONDITIONS',
'HOURLYVISIBILITY',
'HOURLYPRSENTWEATHERTYPE',
-'HOURLYDRYBULBTEMPF',
'HOURLYWETBULBTEMPF',
+'HOURLYDRYBULBTEMPF',
'HOURLYDewPointTempF',
'HOURLYRelativeHumidity',
'HOURLYWindSpeed',
@@ -45,19 +38,55 @@ hourlyColumns = ['DATE',
# In[ ]:
-def getWeatherData():
- return pd.read_csv(r'../data/1052640.csv',usecols = hourlyColumns,low_memory = False)
+#Defining functions - all together so we can see them and know what we have to work with
+# without scrolling through entire program
+
+
+# In[ ]:
+
+def getAllWeatherData():
+ return pd.read_csv(r'../data/1052640.csv',low_memory=False)
+
+
+# In[ ]:
+
+def getHourlyWeatherData():
+ return pd.read_csv(r'../data/1052640.csv',usecols = hourlyColumns, low_memory = False)
+
+
+# In[ ]:
+
+def displayWeatherData(array):
+ print array.columns.values
# In[ ]:
-gsoData = getWeatherData()
+#delaring variables from the functions - don't need to know exactly what is in them to use them
+gsoDataAll = getAllWeatherData()
+gsoDataHours = getHourlyWeatherData()
+
+
+# In[ ]:
+
+#How to use the display method
+displayWeatherData(gsoDataAll)
+displayWeatherData(gsoDataHours)
+
+
+# This is just a smaller subset of the columns. Daily and Monthly rollups were ignored. Fahrenheit temps used instead of Celcius.
+
+# In[ ]:
+
+
+gsoData = getHourlyWeatherData()
# Verifying the columns.
# In[ ]:
+
gsoData.info()
@@ -65,6 +94,7 @@ gsoData.info()
# In[ ]:
+
gsoData.rename(columns = {'REPORTTPYE':'REPORTTYPE'}, inplace=True)
@@ -72,7 +102,9 @@ gsoData.rename(columns = {'REPORTTPYE':'
# In[ ]:
+
gsoData[gsoData.REPORTTYPE == 'SOD']
+
# Dropping **S**tart **O**f **D**ay
Comments:
-def getWeatherData():
- return pd.read_csv(r'../data/1052640.csv',usecols = hourlyColumns,low_memory = False)
+#Defining functions - all together so we can see them and know what we have to work with
+# without scrolling through entire program
+
+
+# In[ ]:
+
+def getAllWeatherData():
+ return pd.read_csv(r'../data/1052640.csv',low_memory=False)
+
+
+# In[ ]:
+
+def getHourlyWeatherData():
+ return pd.read_csv(r'../data/1052640.csv',usecols = hourlyColumns, low_memory = False)
The downside of doing this is that fetching all data and hourly data involves reading the data file twice. It might get cached, but I wouldn't count on it.
Why not just read all the data and filter it down to what we want? In this case just extract the columns of interest, e.g. weatherData[hourlyColumns]
. Presumably we'll be doing lots of transformations on the data anyway, so having one canonical copy of the data resident in memory and deriving everything else from that just makes sense to me.
+def displayWeatherData(array):
+ print array.columns.values
This seems to just print the column names. If that's the intent, I think the name is a bit misleading.
Also, presumably this will work for any DataFrame
, so maybe something like showColumns()
?
These changes do improve readability, but I think that working with an IPython notebook is itself the most severe constraint on readability. Unless I'm vastly underestimating the scope of this project, keeping everything in one file is going to be nightmarish. That being said, for experimenting with a data set (which I presume is precisely what this notebook is for), I think it's fine.
One last thought that I had is that this data needs to be cleaned up a bit before we can use it. I noticed some of the temperature values included non-numeric characters and such, which I believe get converted to NaN before plotting. For cases like that it may be better to just strip the non-numerics out.
I'd written a bit of code to do all that at some point--I'll see if i can dig it up...
One other point:
+gsoDataAll = getAllWeatherData()
+gsoDataHours = getHourlyWeatherData()
snip
+gsoData = getHourlyWeatherData()
we can probably do pd.to_numeric(DataFrame,errors='coerce').dropna(how='any')
to get rid of any NaNs.
as for readability, my primary goal with both ./src/GSOWeather.ipynb
and ./src/LibraryData.ipynb
was to get a start on importing the data based upon the files given. I'm sure that the final product will be far more concise.
That said, I wouldn't want us to get too terribly bogged down with whether or not these files look good. It's merely to test functionality, and to see where things might break. For example, the ./src/LibraryData.ipynb
file breaks when there aren't unique datestamps for a given machine name. That's no bueno.
One more thing: Exporting to python can also be done from within the notebook while it is running:
and clearing the output can also be done here:
@brownworth the problem isn't with NaN
s, those get filled over with DataFrame.ffill()
. The problem is when you have a temperature value recorded as e.g. 73s
. The question is: do we assume that this was supposed to be 73, or discount it as invalid data? NWS doesn't have any guidance on this particular issue in the documentation for this particular data set.
I would hesitate to use .ffill()
unless absolutely necessary. It changes data, by replacing all np.NaN with the previous non-null value in the column. Since we are using samples without a uniform or periodic rate, I would advocate for dropping invalid data. That said, if we end up resampling the data to align with the library data, we may end up using .ffill()
anyway.
I've just updated the ./src/GSOWeather.ipynb
in the 'brown' branch with my take on using pd.to_numeric(errors='coerce')
. I left the output for the necessary cells intact.
it may be irrelevant, as I'm not sure whether or not we should be using dry bulb or wet bulb temp.
Since we are using samples without a uniform or periodic rate, I would advocate for dropping invalid data. That said, if we end up resampling the data to align with the library data, we may end up using .ffill() anyway.
Or some other interpolation method. For this use case at least linear would likely be appropriate.
Going to create more specific issues as milestones as suggested in class. Recommend that we put group questions here.
In interest of making reading easier, I have a suggestion.
I've altered the first part of the GSOWeather.ipynb notebook in the manner I'm suggesting so you can see. In this case I've made a display method which takes an array as an argument, which can be fetched by either of our getWeather methods (one says All, and another says Hourly). The rest of the notebook can be easily altered in a similar manner. This is just to make it easier to maintain in the future. Thoughts?