DaveBackus / Data_Bootcamp

Materials for a course at NYU Stern using Python to study economic and financial data.
MIT License
72 stars 52 forks source link

pandas cleaning notebook #14

Closed sglyon closed 8 years ago

sglyon commented 8 years ago

Below is a list of comments. If you want me to make any of the changes let me know:

DaveBackus commented 8 years ago

Question: what do you make of the rsplit method changing from maxsplit=1 in pure Python to n=1 in Pandas. Are these the same methods?

On Mon, Apr 4, 2016 at 7:51 AM, Spencer Lyon notifications@github.com wrote:

Below is a list of comments. If you want me to make any of the changes let me know:

  • When you start the section on string methods you include the pure python example of making "$123.45" a float two times. I think just once is enough.
  • I've often used string methods and pd.to_datetime to convert three numeric columns for year, month, day into a single column with a pandas date time type.
  • When you introduce selecting variables and observations I also hear the term indexing often (in addition to subsetting, filtering, and slicing)
  • There is a typo in the 4th bullet point when showing all the ways you can index into df. You wrote df[nlist]] instead of df[nlist]
  • When talking about the boolean selection we might want to introduce the query method. It is very concise and compiles the expressions and runs them in a more efficient way than we do when constructing these series/DataFrames of booleans by hand. I also like that it makes us not have to manage the boolean objects ourselves -- there's less room for bugs when we handle less temporary variables.
  • Formatting preferences: for the exercise at the bottom, I'd include the list of questions after the code snippet that loads the data instead of before it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/DaveBackus/Data_Bootcamp/issues/14

sglyon commented 8 years ago

I don't think they are the same. Look at the different pandas series str method names:

In [8]: s = pd.Series(pd.np.random.randn(10), dtype=object)

In [9]: s.str.
s.str.capitalize     s.str.find           s.str.islower        s.str.lstrip         s.str.rjust          s.str.swapcase
s.str.cat            s.str.findall        s.str.isnumeric      s.str.match          s.str.rpartition     s.str.title
s.str.center         s.str.get            s.str.isspace        s.str.normalize      s.str.rsplit         s.str.translate
s.str.contains       s.str.get_dummies    s.str.istitle        s.str.pad            s.str.rstrip         s.str.upper
s.str.count          s.str.index          s.str.isupper        s.str.partition      s.str.slice          s.str.wrap
s.str.decode         s.str.isalnum        s.str.join           s.str.repeat         s.str.slice_replace  s.str.zfill
s.str.encode         s.str.isalpha        s.str.len            s.str.replace        s.str.split
s.str.endswith       s.str.isdecimal      s.str.ljust          s.str.rfind          s.str.startswith
s.str.extract        s.str.isdigit        s.str.lower          s.str.rindex         s.str.strip

Compared to methods on a python string:

In [9]: s1 = "foo"

In [10]: s1.
s1.capitalize    s1.expandtabs    s1.isalpha       s1.isprintable   s1.lower         s1.rindex        s1.splitlines    s1.upper
s1.casefold      s1.find          s1.isdecimal     s1.isspace       s1.lstrip        s1.rjust         s1.startswith    s1.zfill
s1.center        s1.format        s1.isdigit       s1.istitle       s1.maketrans     s1.rpartition    s1.strip
s1.count         s1.format_map    s1.isidentifier  s1.isupper       s1.partition     s1.rsplit        s1.swapcase
s1.encode        s1.index         s1.islower       s1.join          s1.replace       s1.rstrip        s1.title
s1.endswith      s1.isalnum       s1.isnumeric     s1.ljust         s1.rfind         s1.split         s1.translate

It is unfortunate that they chose the same method name, but different argument names. I'm not sure there is much more to say about it

DaveBackus commented 8 years ago

Ah, good to know.

What's curious here is that help generates docs for the second. Eg, df['var'].str.rsplit?

On Mon, Apr 4, 2016 at 11:58 AM, Spencer Lyon notifications@github.com wrote:

I don't think they are the same. Look at the different pandas series str method names:

In [8]: s = pd.Series(pd.np.random.randn(10), dtype=object)

In [9]: s.str. s.str.capitalize s.str.find s.str.islower s.str.lstrip s.str.rjust s.str.swapcases.str.cat s.str.findall s.str.isnumeric s.str.match s.str.rpartition s.str.title s.str.center s.str.get s.str.isspace s.str.normalize s.str.rsplit s.str.translate s.str.contains s.str.get_dummies s.str.istitle s.str.pad s.str.rstrip s.str.upper s.str.count s.str.index s.str.isupper s.str.partition s.str.slice s.str.wrap s.str.decode s.str.isalnum s.str.join s.str.repeat s.str.slice_replace s.str.zfill s.str.encode s.str.isalpha s.str.len s.str.replace s.str.split s.str.endswith s.str.isdecimal s.str.ljust s.str.rfind s.str.startswith s.str.extract s.str.isdigit s.str.lower s.str.rindex s.str.strip

Compared to methods on a python string:

In [9]: s1 = "foo"

In [10]: s1. s1.capitalize s1.expandtabs s1.isalpha s1.isprintable s1.lower s1.rindex s1.splitlines s1.upper s1.casefold s1.find s1.isdecimal s1.isspace s1.lstrip s1.rjust s1.startswith s1.zfill s1.center s1.format s1.isdigit s1.istitle s1.maketrans s1.rpartition s1.strip s1.count s1.format_map s1.isidentifier s1.isupper s1.partition s1.rsplit s1.swapcase s1.encode s1.index s1.islower s1.join s1.replace s1.rstrip s1.title s1.endswith s1.isalnum s1.isnumeric s1.ljust s1.rfind s1.split s1.translate

It is unfortunate that they chose the same method name, but different argument names. I'm not sure there is much more to say about it

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/DaveBackus/Data_Bootcamp/issues/14#issuecomment-205366632

sglyon commented 8 years ago

Not sure I follow. Continuing from my same example:

In [10]: s.str.rsplit?
Signature: s.str.rsplit(pat=None, n=-1, expand=False)
Docstring:
Split each string in the Series/Index by the given delimiter
string, starting at the end of the string and working to the front.
Equivalent to :meth:`str.rsplit`.

.. versionadded:: 0.16.2

Parameters
----------
pat : string, default None
    Separator to split on. If None, splits on whitespace
n : int, default -1 (all)
    None, 0 and -1 will be interpreted as return all splits
expand : bool, default False
    * If True, return DataFrame/MultiIndex expanding dimensionality.
    * If False, return Series/Index.

Returns
-------
split : Series/Index or DataFrame/MultiIndex of objects
File:      ~/anaconda3/lib/python3.5/site-packages/pandas/core/strings.py
Type:      method

In [11]: s1.rsplit?
Docstring:
S.rsplit(sep=None, maxsplit=-1) -> list of strings

Return a list of the words in S, using sep as the
delimiter string, starting at the end of the string and
working to the front.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified, any whitespace string
is a separator.
Type:      builtin_function_or_method
DaveBackus commented 8 years ago

We're getting different docs for the string methods. If I try

docs['Country'].str.rsplit?

I get the maxsplit version.

On Mon, Apr 4, 2016 at 12:06 PM, Spencer Lyon notifications@github.com wrote:

Not sure I follow. Continuing from my same example:

In [10]: s.str.rsplit? Signature: s.str.rsplit(pat=None, n=-1, expand=False) Docstring: Split each string in the Series/Index by the given delimiter string, starting at the end of the string and working to the front. Equivalent to :meth:str.rsplit.

.. versionadded:: 0.16.2

Parameters

pat : string, default None Separator to split on. If None, splits on whitespace n : int, default -1 (all) None, 0 and -1 will be interpreted as return all splits expand : bool, default False * If True, return DataFrame/MultiIndex expanding dimensionality. * If False, return Series/Index.

Returns

split : Series/Index or DataFrame/MultiIndex of objects File: ~/anaconda3/lib/python3.5/site-packages/pandas/core/strings.py Type: method

In [11]: s1.rsplit? Docstring: S.rsplit(sep=None, maxsplit=-1) -> list of strings

Return a list of the words in S, using sep as the delimiter string, starting at the end of the string and working to the front. If maxsplit is given, at most maxsplit splits are done. If sep is not specified, any whitespace string is a separator. Type: builtin_function_or_method

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/DaveBackus/Data_Bootcamp/issues/14#issuecomment-205369275

sglyon commented 8 years ago

That's a mystery to me. Not sure what's going on there

DaveBackus commented 8 years ago

I now made all the changes, but have yet to synch w GH. Except one. I didn't add the query. After playing around, it seemed to me not to apply to the same examples, so I skipped it. Maybe next time?

Thanks again.

On Mon, Apr 4, 2016 at 12:24 PM, Spencer Lyon notifications@github.com wrote:

That's a mystery to me. Not sure what's going on there

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/DaveBackus/Data_Bootcamp/issues/14#issuecomment-205375735

DaveBackus commented 8 years ago

Done. Except query, which we can think about next time.