Create 'raw' and 'cleaned' version of data

buds-lab / building-data-genome-project-2

Whole building non-residential hourly energy meter data from the Great Energy Predictor III competition

https://www.budslab.org/

Other

186 stars 71 forks source link

Create 'raw' and 'cleaned' version of data #11

Closed cmiller8 closed 4 years ago

cmiller8 commented 4 years ago

It would be good to add to the documentation the differences between the raw, cleaned and processed folders in the wiki

cmiller8 commented 4 years ago

Is there a notebook that converts between raw and cleaned in this repository? What do we use the cleaned data for again?

If it's in the internal use only github repository, then let's remove the raw data set so this repository gets a little smaller.

It looks like this notebook converts the raw files into one big file. We might remove that big file -- users can create that file on their own machines using the notebooks. The size of the repository could be reduced quite a bit I think...

cmiller8 commented 4 years ago

I think I found a clue about the cleaned data folder - it looks like you used to have these files excluded:

And it looks like you use the cleaned files in your model prototypes. It would be good to have the code that converts the raw files to cleaned, but don't include those cleaned files or the combined file in the repository. People who fork the repository can creat those aggregated files on their own with the notebook. This can reduce the size of the repository significantly I think.

anjukan commented 4 years ago

Is repository space an issue? I think we should include the cleaned aggregate data. A lot of users will be happy to go with the preprocessing and cleaning we have employed and dive straight into the data. I'm not it makes sense for users to have to go through the processing and cleaning steps themselves (if this is not something they are interested in)?

ponybiam commented 4 years ago

I have just uploaded those cleaned datasets, I will write the difference in the documentation.

raw: those are the files converted in ashrae-great-energy-predictor-3-internal-data-prep.
cleaned: those are recently uploaded, they are the raw meters without outliers and zero-readings during the whole day. That was originally for the weather sensitivity plot but I end up using them for the model. I will create a notebook only for the cleaning.
processed: that big dataset was not suposed to be uploaded. My mistake, sorry.
screening: is the imput data for the cleaning (outliers) and the imput data for breakout detection.

processed folder will be empty, and I think we could leave screeening or cleaning, only one of the folders.

cmiller8 commented 4 years ago

I think we should try to get the Github repository as small as possible. Please include the code to convert the raw data to the clean and screening data sets, but don't include them in the repository. People can create them from the raw data if they'd like and then perform the other functions, we just need to document that.

Let's instead put the cleaned files on the Kaggle Data page -- if people don't want to clean the raw data from the github repository, then we can guide them to that page (yet to be made)

ponybiam commented 4 years ago

Notebook to create cleaned datasets
Notebooks to create screening datasets (I converted these to ipynb and now can be displayed inside the repository, but are still in R language):
- Anomalies
- Breakout

I'm removing cleaned and screening data sets from the repository.

cmiller8 commented 4 years ago

Let's try to get this process documented clearly and then we can close this issue.

I will create the Kaggle data page and post it on the wiki and the readme.md

ponybiam commented 4 years ago

In this wiki page is explained what each folder contained. We can link it to the Kaggle data page once is created.

cmiller8 commented 4 years ago

As discussed today in the call:

Create a raw version of the data that is most similar to the Kaggle competition data and also a cleaned version that has all the conversions, outliers removal, and buildings removed with too little data.

cmiller8 commented 4 years ago

Only have one version of weather

ponybiam commented 4 years ago

As we concluded in our meeting, these will be the data sets in this repository:

[x] raw - this is 2016-2017 meter reading of 1636 buildings. All energy meters are in kWh and all volume meters in litres. This is the main data set, the paper is focused on this one.
[x] kaggle - this is the public leaderboard data set used for the competition, 2017 data of kaggle buildings. (Issue #17 )
[x] cleaned - this is a super cleaned data set. Outliers, longer than continuous 24 hours zero readings, zero readings in electricity... all data that looks weird will be removed (or replaced by NaN value). This a pretty aggressive strategy, but if it is too much cleaning, we can always go back to the raw. This data set is to try some predictive models.
[x] Update wiki with these changes.

ponybiam commented 4 years ago

Data sets were created and are in the repository, wiki was updated based on these changes.