buds-lab / building-data-genome-project-2

Whole building non-residential hourly energy meter data from the Great Energy Predictor III competition
https://www.budslab.org/
Other
186 stars 71 forks source link

Create 'raw' and 'cleaned' version of data #11

Closed cmiller8 closed 4 years ago

cmiller8 commented 4 years ago

It would be good to add to the documentation the differences between the raw, cleaned and processed folders in the wiki

image

cmiller8 commented 4 years ago

Is there a notebook that converts between raw and cleaned in this repository? What do we use the cleaned data for again?

If it's in the internal use only github repository, then let's remove the raw data set so this repository gets a little smaller.

It looks like this notebook converts the raw files into one big file. We might remove that big file -- users can create that file on their own machines using the notebooks. The size of the repository could be reduced quite a bit I think...

cmiller8 commented 4 years ago

I think I found a clue about the cleaned data folder - it looks like you used to have these files excluded:

image

And it looks like you use the cleaned files in your model prototypes. It would be good to have the code that converts the raw files to cleaned, but don't include those cleaned files or the combined file in the repository. People who fork the repository can creat those aggregated files on their own with the notebook. This can reduce the size of the repository significantly I think.

anjukan commented 4 years ago

Is repository space an issue? I think we should include the cleaned aggregate data. A lot of users will be happy to go with the preprocessing and cleaning we have employed and dive straight into the data. I'm not it makes sense for users to have to go through the processing and cleaning steps themselves (if this is not something they are interested in)?

ponybiam commented 4 years ago

I have just uploaded those cleaned datasets, I will write the difference in the documentation.

processed folder will be empty, and I think we could leave screeening or cleaning, only one of the folders.

cmiller8 commented 4 years ago

I think we should try to get the Github repository as small as possible. Please include the code to convert the raw data to the clean and screening data sets, but don't include them in the repository. People can create them from the raw data if they'd like and then perform the other functions, we just need to document that.

Let's instead put the cleaned files on the Kaggle Data page -- if people don't want to clean the raw data from the github repository, then we can guide them to that page (yet to be made)

ponybiam commented 4 years ago

I'm removing cleaned and screening data sets from the repository.

cmiller8 commented 4 years ago

Let's try to get this process documented clearly and then we can close this issue.

I will create the Kaggle data page and post it on the wiki and the readme.md

ponybiam commented 4 years ago

In this wiki page is explained what each folder contained. We can link it to the Kaggle data page once is created.

cmiller8 commented 4 years ago

As discussed today in the call:

Create a raw version of the data that is most similar to the Kaggle competition data and also a cleaned version that has all the conversions, outliers removal, and buildings removed with too little data.

cmiller8 commented 4 years ago

Only have one version of weather

ponybiam commented 4 years ago

As we concluded in our meeting, these will be the data sets in this repository:

ponybiam commented 4 years ago

Data sets were created and are in the repository, wiki was updated based on these changes.