Closed cmiller8 closed 4 years ago
Is there a notebook that converts between raw
and cleaned
in this repository? What do we use the cleaned
data for again?
If it's in the internal use only github repository, then let's remove the raw data set so this repository gets a little smaller.
It looks like this notebook converts the raw files into one big file. We might remove that big file -- users can create that file on their own machines using the notebooks. The size of the repository could be reduced quite a bit I think...
I think I found a clue about the cleaned data folder - it looks like you used to have these files excluded:
And it looks like you use the cleaned files in your model prototypes. It would be good to have the code that converts the raw files to cleaned, but don't include those cleaned files or the combined file in the repository. People who fork the repository can creat those aggregated files on their own with the notebook. This can reduce the size of the repository significantly I think.
Is repository space an issue? I think we should include the cleaned aggregate data. A lot of users will be happy to go with the preprocessing and cleaning we have employed and dive straight into the data. I'm not it makes sense for users to have to go through the processing and cleaning steps themselves (if this is not something they are interested in)?
I have just uploaded those cleaned datasets, I will write the difference in the documentation.
processed
folder will be empty, and I think we could leave screeening
or cleaning
, only one of the folders.
I think we should try to get the Github repository as small as possible. Please include the code to convert the raw data to the clean
and screening
data sets, but don't include them in the repository. People can create them from the raw data if they'd like and then perform the other functions, we just need to document that.
Let's instead put the cleaned files on the Kaggle Data page -- if people don't want to clean the raw data from the github repository, then we can guide them to that page (yet to be made)
I'm removing cleaned
and screening
data sets from the repository.
Let's try to get this process documented clearly and then we can close this issue.
I will create the Kaggle data page and post it on the wiki and the readme.md
In this wiki page is explained what each folder contained. We can link it to the Kaggle data page once is created.
As discussed today in the call:
Create a raw
version of the data that is most similar to the Kaggle competition data and also a cleaned
version that has all the conversions, outliers removal, and buildings removed with too little data.
Only have one version of weather
As we concluded in our meeting, these will be the data sets in this repository:
raw
- this is 2016-2017 meter reading of 1636 buildings. All energy meters are in kWh and all volume meters in litres. This is the main data set, the paper is focused on this one.kaggle
- this is the public leaderboard data set used for the competition, 2017 data of kaggle buildings. (Issue #17 )cleaned
- this is a super cleaned data set. Outliers, longer than continuous 24 hours zero readings, zero readings in electricity... all data that looks weird will be removed (or replaced by NaN value). This a pretty aggressive strategy, but if it is too much cleaning, we can always go back to the raw. This data set is to try some predictive models.Data sets were created and are in the repository, wiki was updated based on these changes.
It would be good to add to the documentation the differences between the
raw
,cleaned
andprocessed
folders in the wiki