data-8 / datascience

A Python library for introductory data science
BSD 3-Clause "New" or "Revised" License
613 stars 289 forks source link

Missing value handling #162

Closed stefanv closed 7 years ago

stefanv commented 8 years ago

Speaking to @cboettig today, he mentioned that a lot of datasets he's dealing with have missing values.

What is our approach to handling these? Using numpy as underlying storage, we'll run into problems with any data-type other than floats. Alternatives include: using Pandas DataFrames as the backend, or using NumPy MaskedArrays.

cboettig commented 8 years ago

Certainly a tricky issue -- pandas does appear to have pretty rich support for this, but with that flexibility also comes an increased complexity / more gotchas where intuition from base pandas and numpy is harder to apply to pandas objects (which helps me further appreciate why the foundations team chose to create datascience instead of teaching pandas). Perhaps there are clever ways around this, but maybe this is just out of scope for datascience and connectors can just focus on the kinds of things that can be done with the tools of the foundation course, or tackle the burden of introducing a new tool to address issues not covered there (e.g. introducing a bit of pandas, maybe more as a preview than a skill to master, just to handle missing data). At least we already have the coercion methods between Tables and DataFrames.

choldgraf commented 8 years ago

my 2 cents is that Pandas is super powerful for these kinds of situations, but as you say it's a PITA to really teach. I'd try to stray away from having lots of NaNs in data during the class, and (maybe at the end?) spend a little bit of time showing what "real" data looks like and mentioning some of the tools we have for dealing with it (e.g., pandas, string parsing, etc). To me that stuff is more like "advanced data munging" and is maybe a step beyond an intro course.

choldgraf commented 8 years ago

and FWIW it's also probably something that most students won't appreciate until they've run into some problems like these on their own.

I remember being totally confused as to why it was such a big deal that dataframes can easily contain different types of data, along with missing values, etc. After doing enough data analysis it made me appreciate it later on, but that took some time.

mijordan3 commented 8 years ago

Carl,

My 2c on the missing data issue: The course design emphasized that every idea should have an inferential side, a computational side and a real world side. Missing data certainly has the latter. Computationally indeed one could get into the representational issues you're mentioning. Inferentially, the problem is that of imputing values for the missing variables, not merely representing them, and doing that right requires something like the multiple regression perspective that we mostly punted on in this course, at least in its first instantiation. I do think that missing values (and multiple regression) should be a major focus of a followup course.

Mike

On Fri, Dec 18, 2015 at 2:59 PM, Carl Boettiger notifications@github.com wrote:

Certainly a tricky issue -- pandas does appear to have pretty rich support for this, but with that flexibility also comes an increased complexity / more gotchas where intuition from base pandas and numpy is harder to apply to pandas objects (which helps me further appreciate why the foundations team chose to create datascience instead of teaching pandas). Perhaps there are clever ways around this, but maybe this is just out of scope for datascience and connectors can just focus on the kinds of things that can be done with the tools of the foundation course, or tackle the burden of introducing a new tool to address issues not covered there (e.g. introducing a bit of pandas, maybe more as a preview than a skill to master, just to handle missing data). At least we already have the coercion methods between Tables and DataFrames.

— Reply to this email directly or view it on GitHub https://github.com/dsten/datascience/issues/162#issuecomment-165916146.

deculler commented 8 years ago

John and I went over this issue quite a bit last summer. There is no question that handling missing values is a strength of pandas. And it is extremely subtle. Addressing that while these students are in their very very first exposure is too complicated and distracting. Recognizing where it arises and dealing with it as part of the cleaning and or curating is advised. You will find that you need to do some of that regardless to bring it down to the freshman level. A follow-on course, probably at the upper division can reasonably tackle it. I did a huge number of example analyses with all sorts of messy public data - in dsten/demos to verify that this was reasonable. Every area has their "professional strength" aspects - whether it be arcgis, or pandas, or transys or any of a number of things. But you cannot have everything first. You have to start somewhere, build a foundation, and then build structure upon that.

I'd be happy to look with you at the data set and analysis where this is arising. Lunch that there is a natural work around that makes it better suited to freshman. On Dec 17, 2015 1:51 PM, "Stefan van der Walt" notifications@github.com wrote:

Speaking to @cboettig https://github.com/cboettig today, he mentioned that a lot of datasets he's dealing with have missing values.

What is our approach to handling these? Using numpy as underlying storage, we'll run into problems with any data-type other than floats. Alternatives include: using Pandas DataFrames as the backend, or using NumPy MaskedArrays.

— Reply to this email directly or view it on GitHub https://github.com/dsten/datascience/issues/162.

cboettig commented 8 years ago

Thanks @mijordan3 @deculler @papajohn , your perspective and experience in this is really helpful, (and particularly for someone like me with limited experience teaching at this level). I can relate somewhat as a new user to python, learning both Tables and pandas, that Tables avoids quite a few barriers and issues that pandas creates by how it handles indexing, etc.

David, I'd love to look at any of your demos that show a bit of handling of missing data. I don't think dsten/demos exists (??), but I have found the examples you have in deculler/TableDemos very helpful. In particular, the MappingWrangling... notebook seems pretty relevant to this discussion, though my quick skim didn't show where any of the columns that contain NaNs are actually used. Nevertheless it was interesting & surprising to realize that columns in Tables could have mixed type, including NaN, float, and str in the same column! (such as the "Units" column of the Parcels.csv data shown there). It's columns like that which I have found tricky to use apply functions with that need to handle the missing data; if only at the most trivial level for this course (e.g. just omitting missing data).

For instance, here is one example that @stefanv originally helped me with that required a bit of explicit NaN handling: https://github.com/dsten/ecology-connector/blob/master/fish/fish.ipynb , where we just need to take care that float < missing returns missing instead of False. Notably the strategy we take there fails on columns with mixed type since we can't call np.isnan on data that is sometimes a float (np.nan) and sometimes a string. (Sometimes the lack of things like pd.nafill() or pd.nadrop() make the steps required look more complicated).

Anyway, thanks again for your thoughts. I will be doing my best to avoid pulling freshmen into needless missing-value complexity, but just trying to see myself clear to the best way of doing so.

stefanv commented 8 years ago

A pragmatic first step may be to have a few different ways to handle missing values when loading a Table. This should fill the gap described in Carl's second last paragraph?

papajohn commented 8 years ago

I think a reasonable solution for missing value handling is to load data as a pandas dataframe, process the missing values there, then convert to a Table.

cboettig commented 8 years ago

@papajohn Thanks, that's good to hear! I've recently been doing just that in my notes so far and it works rather nicely. Sometimes the change back to Table feels pedantic, e.g. when I'm just going to call df.plot() anyway since both pandas and Table have plot methods.

stefanv commented 8 years ago

I wonder if it would make sense, in the long run, to turn Table into a view of a Pandas dataframe. It has the advantage that advanced features could be available with an easy table.df.api_call.

papajohn commented 8 years ago

Yep — nice idea. I'm trying out such an implementation now. I should have it working soon.

On Wed, Jan 13, 2016 at 3:01 PM, Stefan van der Walt < notifications@github.com> wrote:

I wonder if it would make sense, in the long run, to turn Table into a view of a Pandas dataframe. It has the advantage that advanced features could be available with an easy table.df.api_call.

— Reply to this email directly or view it on GitHub https://github.com/dsten/datascience/issues/162#issuecomment-171465325.