datacarpentry / python-ecology-lesson

Data Analysis and Visualization in Python for Ecologists
https://datacarpentry.org/python-ecology-lesson
Other
162 stars 308 forks source link

Issue in episode 4 (data types and formats): replace with means of a df #410

Closed Benfeitas closed 5 years ago

Benfeitas commented 5 years ago

Note the line df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean()) in episode_4. Or the web version

I'm thinking that this should be

df1['weight'] = surveys_df['weight'].copy().fillna(surveys_df['weight'].mean())

Note that in this way we are working on the copy, rather than on a reference for survey_df.

wrightaprilm commented 5 years ago

Is that necessary since we're writing to a copy? Not opposed, just curious.

Benfeitas commented 5 years ago

@wrightaprilm In this case I think it is safest to specify .copy(): not doing so may modify surveys_df which is the original df right?

wrightaprilm commented 5 years ago

You're probably right. I'm not as good as I should be about making copies of dataframes. When you call .copy() without an assignment, is that copy ephemeral? I'm looking at this and thinking that maybe what we should have is a small reminder box:

Reminder

Python uses 0-based indexing. {: .callout}

on copying. When to do it, scope of copied objects, etc. That's not a commitment that you have to make it - I can toss it in our contributing guide for new learners.

Anyway, so, in the immediate term: Let me know if you'd like to submit a PR with the changed line from your initial post, or if you'd rather that I do it. And let me know, too, if you might be interested in a copying call-out box.

Benfeitas commented 5 years ago

@wrightaprilm sorry for the delay in coming back to you.

To be honest, I'm not sure what happens behind the scenes when you do .copy(), but I think this creates a new object in memory, and thus any changes that you do to df1 will not be carried to surveys_df(). I've learned this the hard way in other occasions: sometimes, when I didn't do .copy(), and masked or changed my df1, the changes would be carried over to surveys_df.

I'll make the PR for the line above

maxim-belkin commented 5 years ago

Apologies for joining the discussion so late.

So, I think there is some confusion...

When dealing with DataFrame selections -- yes, we have to be careful as we might end up modifying the original data set (e.g. view or a reference). However, we've got .fillna() method which, unless inplace = True is specified, returns a copy of the DataFrame being processed.

After a brief look at the source code of pandas, I couldn't find where it would not honor that inplace option... So, I don't think we need that .copy because of .fillna.