fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
924 stars 196 forks source link

Add check for missing values to GroupByLabels #380

Closed sebhofer closed 5 years ago

sebhofer commented 6 years ago

Throw error when trying to group by column with missing values. This leads to unexpected results (see, e.g., this issue). This is not really a fix, but grouping on missing values seemsill-defined anyway.

zyzhu commented 5 years ago

@sebhofer I checked Pandas' behavior on similar case. It basically drops rows of the group column that has missing value first and then group by it. I would prefer that instead of throwing out an error message. I'll dig a bit to see how to accomplish this efficiently.

sebhofer commented 5 years ago

I'm not sure I agree. If you just drop some rows and move on quietly, the user might never notice that something is off. In an interactive environment, which is, I think, where Deedle is mostly used, I rather I get an error message, and I can fix the problem myself.

tpetricek commented 5 years ago

I think both throwing values away and throwing an exception are sensible design choices - I would have slight preference for exception, because I feel the F# style is to be a bit more explicit than in the Python style (it just means people have to do some extra work to drop before grouping, but that seems OK).

We might make that easier if we added Frame.dropSparseRowsBy "Column" which would drop rows where the value in Column is missing.

zyzhu commented 5 years ago

@tpetricek I'm ok with throwing error if users got the message that some data are missing so that they can check data and clean it up.

@sebhofer Could you submit another commit to throw a little bit more explicit message to suggest user to check missing data in grouping columns?

sebhofer commented 5 years ago

@zyzhu Sure, will do. I can also try to come up with a PR Frame.dropSparseRowsBy "Column" if that's desired. I really like the idea @tpetricek

sebhofer commented 5 years ago

I had a quick look at the grouping function. As frame.GroupByLabels is used by different grouping functions (I guess the relevant ones are GroupRowsBy and GroupRowsUsing), I think it would be most clear and most helpful to the user if there was a different error for each function. Additionally, I will still keep the check in GroupByLabels to be safe. For each case, I will try to find a simple way to correct the problem and give a hint to the user in the error message. @zyzhu please let me know if you don't like this approach.

zyzhu commented 5 years ago

@sebhofer That makes a lot of sense to me. Thanks!

kblohm commented 5 years ago

Hi, i noticed that you already made a PR for Frame.dropSparseRowsBy, but you also add the function in this PR. Was this intended? There are also quite a lot of merge-commits. Would you mind rebasing/squashing this?

sebhofer commented 5 years ago

@kblohm No, that was not intended! I am new to forking projects and creating PRs so I really messed this up yesterday. Also, I didn't realize that everything I push to the master branch gets added to the PR. I'm not quite sure how to proceed with this. I'm guessing it's best to close this and open a new PR on a dedicated branch. Any input is welcome!

kblohm commented 5 years ago

You could clean this one up, but as far as i know you can not change the source branch. I would just close this and create a branch for your changes and then open a new PR. If you need any help with this, let me know. :)

zyzhu commented 5 years ago

@sebhofer You could sync your fork and then create another branch. Then commit just the change of throwing exception https://github.com/fslaborg/Deedle/pull/380/commits/e85244dae559fbd5ed98bea623aecc4ba8d3a930 and submit the pull.

I'll merge it after that. BTW, @kblohm just changed the build using FAKE 5. After syncing with upstream, you need to do the following to get started.

  1. Install FAKE cli dotnet tool install fake-cli -g
  2. To build, run the following command under Deedle folder fake build target allcore
kblohm commented 5 years ago

You do not need to install fake globally, you can just run fake.cmd / fake.sh and that will install it just for this project. I am also using the global version, but it is not a requirement.

sebhofer commented 5 years ago

@kblohm Yeah, that seems like the cleanest solution. I think I'll be able to handle it... but you never know. I'll let you know otherwise. Thanks for your help!

@zyzhu I'll try to do this tonight. On another note, did you see the issue I opened about working with the Deedle src? I'm struggling a bit with this and could use some advice. Or is this a trivial problem and I'm just missing something really obvious?

zyzhu commented 5 years ago

@sebhofer I did see that. I am traveling for work this week. Will write up some drafts next week when I get back to office. But I will need some help from @tpetricek as he knows the original work more inside out than me as new comer too.

Thanks to @kblohm, we are on FAKE 5 already. Just follow what he wrote above and you shall be able to build the source.

As for how to work with fsx and source code. I will try to write up some sample cases next week.

sebhofer commented 5 years ago

@zyzhu Perfect, thanks. I just wasn't sure if it was just me being stupid!

sebhofer commented 5 years ago

@kblohm I might take you up on offer afterall :) How do I best clean this up? As I now have all these merge-commits in both master and origin/master I'm worried that by rebasing I'll make things worse... or is that not a problem?

kblohm commented 5 years ago

Did you mange to fix this? If you still need to cleanup your master branch, you can

git fetch upstream
git checkout master
git reset upstream/master --hard

This will reset all your changes, but it looks like you already have a new branch with the stuff you wanted to keep. To get this up to your origin, you will have to force push.

zyzhu commented 5 years ago

I've just merged the new pull.

sebhofer commented 5 years ago

@kblohm I did manage to create a new branch for PR, but I still didn’t know how to fix my master. So this is exactly what I needed, thx! Will try tomorrow.

@zyzhu Great, thank you!