RocketDataScientist / DataHack-2017

DataHack 2017 Challenge
0 stars 0 forks source link

Temporal data format #3

Open ddofer opened 7 years ago

ddofer commented 7 years ago

Is there any chance for the data to be uploaded in a cleaner time-stamp format? i.e 1 row per point in time per target/entity . (Instead of a multitude of messy and missing columns, and ballooning size, that's also not "tidy data").

תודה! :)

RocketDataScientist commented 7 years ago

We debated considerably about the data format and decided that the current one is simpler than the suggested 1-row-per-plot. Consider how simple is data handling in the provided XGBoost example (readAndSubmit_sample.ipynb). If you still strongly feel that another format is better, we encourage you to write a script for this simple conversion. We will gladly add it to the repo as a community contribution.

RocketDataScientist commented 7 years ago

The data set in its current format was validated in several ways including an internal hackathon at Rafael. This validation process took significant time and effort. Thus, at this stage we are not willing to make any changes to the official data, to avoid accidentally corrupting it. We will gladly share no warranty community contributions

nathanie commented 7 years ago

guys, I guess you wanted something like that: tr_data.rename(columns={'Unnamed: 0':'id','class':'target'},inplace=True) train_features = [f for f in tr_data.columns if ('Time' not in f)and('target' not in f)] temp = tr_data.loc[:,train_features] temp.fillna(-999,inplace=True) temp.melt(id_vars='id',value_vars=train_features[1:]) I also added a pull request so that everyone else can use that too

enjoy :-)

RocketDataScientist commented 7 years ago

Thanks!

We've added your script to the community folder

ddofer commented 7 years ago

תודה!

On Oct 22, 2017 10:33 PM, "Nathaniel Shimoni" notifications@github.com wrote:

guys, I guess you wanted something like that: tr_data.rename(columns={'Unnamed: 0':'id','class':'target'},inplace=True) train_features = [f for f in tr_data.columns if ('Time' not in f)and('target' not in f)] temp = tr_data.loc[:,train_features] temp.fillna(-999,inplace=True) temp.melt(id_vars='id',value_vars=train_features[1:]) I also added a pull request so that everyone else can use that too

enjoy :-)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/RocketDataScientist/DataHack-2017/issues/3#issuecomment-338502987, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4hg8sygk9iQQovlMDSEhXpk9cbFpEbks5su5icgaJpZM4QCBmW .

nathanie commented 7 years ago

as I wrote within the pull request: while I do think that data preprocessing is indeed a part of the goals in the hackathon this can still quite easily be resolved as shown hereby

Nati

RocketDataScientist commented 7 years ago

@nathanie and @shyzaks thanks for your contribution!

ddofer commented 7 years ago

I wrote some code to clean it up into the tidy format, I'll upload once I see it's all clean and good :)

RocketDataScientist commented 7 years ago

Thanks @ddofer :)

Since all the "Issues" section in this project is less for bugs and more some kind of forum, we will reopen the issue so other people will not miss it