Open ddofer opened 7 years ago
We debated considerably about the data format and decided that the current one is simpler than the suggested 1-row-per-plot. Consider how simple is data handling in the provided XGBoost example (readAndSubmit_sample.ipynb). If you still strongly feel that another format is better, we encourage you to write a script for this simple conversion. We will gladly add it to the repo as a community contribution.
The data set in its current format was validated in several ways including an internal hackathon at Rafael. This validation process took significant time and effort. Thus, at this stage we are not willing to make any changes to the official data, to avoid accidentally corrupting it. We will gladly share no warranty community contributions
guys, I guess you wanted something like that: tr_data.rename(columns={'Unnamed: 0':'id','class':'target'},inplace=True) train_features = [f for f in tr_data.columns if ('Time' not in f)and('target' not in f)] temp = tr_data.loc[:,train_features] temp.fillna(-999,inplace=True) temp.melt(id_vars='id',value_vars=train_features[1:]) I also added a pull request so that everyone else can use that too
enjoy :-)
We've added your script to the community folder
תודה!
On Oct 22, 2017 10:33 PM, "Nathaniel Shimoni" notifications@github.com wrote:
guys, I guess you wanted something like that: tr_data.rename(columns={'Unnamed: 0':'id','class':'target'},inplace=True) train_features = [f for f in tr_data.columns if ('Time' not in f)and('target' not in f)] temp = tr_data.loc[:,train_features] temp.fillna(-999,inplace=True) temp.melt(id_vars='id',value_vars=train_features[1:]) I also added a pull request so that everyone else can use that too
enjoy :-)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/RocketDataScientist/DataHack-2017/issues/3#issuecomment-338502987, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4hg8sygk9iQQovlMDSEhXpk9cbFpEbks5su5icgaJpZM4QCBmW .
as I wrote within the pull request: while I do think that data preprocessing is indeed a part of the goals in the hackathon this can still quite easily be resolved as shown hereby
Nati
I wrote some code to clean it up into the tidy format, I'll upload once I see it's all clean and good :)
Since all the "Issues" section in this project is less for bugs and more some kind of forum, we will reopen the issue so other people will not miss it
Is there any chance for the data to be uploaded in a cleaner time-stamp format? i.e 1 row per point in time per target/entity . (Instead of a multitude of messy and missing columns, and ballooning size, that's also not "tidy data").
תודה! :)