Closed sglyon closed 8 years ago
Question: what do you make of the rsplit method changing from maxsplit=1 in pure Python to n=1 in Pandas. Are these the same methods?
On Mon, Apr 4, 2016 at 7:51 AM, Spencer Lyon notifications@github.com wrote:
Below is a list of comments. If you want me to make any of the changes let me know:
- When you start the section on string methods you include the pure python example of making "$123.45" a float two times. I think just once is enough.
- I've often used string methods and pd.to_datetime to convert three numeric columns for year, month, day into a single column with a pandas date time type.
- When you introduce selecting variables and observations I also hear the term indexing often (in addition to subsetting, filtering, and slicing)
- There is a typo in the 4th bullet point when showing all the ways you can index into df. You wrote df[nlist]] instead of df[nlist]
- When talking about the boolean selection we might want to introduce the query method. It is very concise and compiles the expressions and runs them in a more efficient way than we do when constructing these series/DataFrames of booleans by hand. I also like that it makes us not have to manage the boolean objects ourselves -- there's less room for bugs when we handle less temporary variables.
- Formatting preferences: for the exercise at the bottom, I'd include the list of questions after the code snippet that loads the data instead of before it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/DaveBackus/Data_Bootcamp/issues/14
I don't think they are the same. Look at the different pandas series str
method names:
In [8]: s = pd.Series(pd.np.random.randn(10), dtype=object)
In [9]: s.str.
s.str.capitalize s.str.find s.str.islower s.str.lstrip s.str.rjust s.str.swapcase
s.str.cat s.str.findall s.str.isnumeric s.str.match s.str.rpartition s.str.title
s.str.center s.str.get s.str.isspace s.str.normalize s.str.rsplit s.str.translate
s.str.contains s.str.get_dummies s.str.istitle s.str.pad s.str.rstrip s.str.upper
s.str.count s.str.index s.str.isupper s.str.partition s.str.slice s.str.wrap
s.str.decode s.str.isalnum s.str.join s.str.repeat s.str.slice_replace s.str.zfill
s.str.encode s.str.isalpha s.str.len s.str.replace s.str.split
s.str.endswith s.str.isdecimal s.str.ljust s.str.rfind s.str.startswith
s.str.extract s.str.isdigit s.str.lower s.str.rindex s.str.strip
Compared to methods on a python string:
In [9]: s1 = "foo"
In [10]: s1.
s1.capitalize s1.expandtabs s1.isalpha s1.isprintable s1.lower s1.rindex s1.splitlines s1.upper
s1.casefold s1.find s1.isdecimal s1.isspace s1.lstrip s1.rjust s1.startswith s1.zfill
s1.center s1.format s1.isdigit s1.istitle s1.maketrans s1.rpartition s1.strip
s1.count s1.format_map s1.isidentifier s1.isupper s1.partition s1.rsplit s1.swapcase
s1.encode s1.index s1.islower s1.join s1.replace s1.rstrip s1.title
s1.endswith s1.isalnum s1.isnumeric s1.ljust s1.rfind s1.split s1.translate
It is unfortunate that they chose the same method name, but different argument names. I'm not sure there is much more to say about it
Ah, good to know.
What's curious here is that help generates docs for the second. Eg, df['var'].str.rsplit?
On Mon, Apr 4, 2016 at 11:58 AM, Spencer Lyon notifications@github.com wrote:
I don't think they are the same. Look at the different pandas series str method names:
In [8]: s = pd.Series(pd.np.random.randn(10), dtype=object)
In [9]: s.str. s.str.capitalize s.str.find s.str.islower s.str.lstrip s.str.rjust s.str.swapcases.str.cat s.str.findall s.str.isnumeric s.str.match s.str.rpartition s.str.title s.str.center s.str.get s.str.isspace s.str.normalize s.str.rsplit s.str.translate s.str.contains s.str.get_dummies s.str.istitle s.str.pad s.str.rstrip s.str.upper s.str.count s.str.index s.str.isupper s.str.partition s.str.slice s.str.wrap s.str.decode s.str.isalnum s.str.join s.str.repeat s.str.slice_replace s.str.zfill s.str.encode s.str.isalpha s.str.len s.str.replace s.str.split s.str.endswith s.str.isdecimal s.str.ljust s.str.rfind s.str.startswith s.str.extract s.str.isdigit s.str.lower s.str.rindex s.str.strip
Compared to methods on a python string:
In [9]: s1 = "foo"
In [10]: s1. s1.capitalize s1.expandtabs s1.isalpha s1.isprintable s1.lower s1.rindex s1.splitlines s1.upper s1.casefold s1.find s1.isdecimal s1.isspace s1.lstrip s1.rjust s1.startswith s1.zfill s1.center s1.format s1.isdigit s1.istitle s1.maketrans s1.rpartition s1.strip s1.count s1.format_map s1.isidentifier s1.isupper s1.partition s1.rsplit s1.swapcase s1.encode s1.index s1.islower s1.join s1.replace s1.rstrip s1.title s1.endswith s1.isalnum s1.isnumeric s1.ljust s1.rfind s1.split s1.translate
It is unfortunate that they chose the same method name, but different argument names. I'm not sure there is much more to say about it
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/DaveBackus/Data_Bootcamp/issues/14#issuecomment-205366632
Not sure I follow. Continuing from my same example:
In [10]: s.str.rsplit?
Signature: s.str.rsplit(pat=None, n=-1, expand=False)
Docstring:
Split each string in the Series/Index by the given delimiter
string, starting at the end of the string and working to the front.
Equivalent to :meth:`str.rsplit`.
.. versionadded:: 0.16.2
Parameters
----------
pat : string, default None
Separator to split on. If None, splits on whitespace
n : int, default -1 (all)
None, 0 and -1 will be interpreted as return all splits
expand : bool, default False
* If True, return DataFrame/MultiIndex expanding dimensionality.
* If False, return Series/Index.
Returns
-------
split : Series/Index or DataFrame/MultiIndex of objects
File: ~/anaconda3/lib/python3.5/site-packages/pandas/core/strings.py
Type: method
In [11]: s1.rsplit?
Docstring:
S.rsplit(sep=None, maxsplit=-1) -> list of strings
Return a list of the words in S, using sep as the
delimiter string, starting at the end of the string and
working to the front. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified, any whitespace string
is a separator.
Type: builtin_function_or_method
We're getting different docs for the string methods. If I try
docs['Country'].str.rsplit?
I get the maxsplit version.
On Mon, Apr 4, 2016 at 12:06 PM, Spencer Lyon notifications@github.com wrote:
Not sure I follow. Continuing from my same example:
In [10]: s.str.rsplit? Signature: s.str.rsplit(pat=None, n=-1, expand=False) Docstring: Split each string in the Series/Index by the given delimiter string, starting at the end of the string and working to the front. Equivalent to :meth:
str.rsplit
... versionadded:: 0.16.2
Parameters
pat : string, default None Separator to split on. If None, splits on whitespace n : int, default -1 (all) None, 0 and -1 will be interpreted as return all splits expand : bool, default False * If True, return DataFrame/MultiIndex expanding dimensionality. * If False, return Series/Index.
Returns
split : Series/Index or DataFrame/MultiIndex of objects File: ~/anaconda3/lib/python3.5/site-packages/pandas/core/strings.py Type: method
In [11]: s1.rsplit? Docstring: S.rsplit(sep=None, maxsplit=-1) -> list of strings
Return a list of the words in S, using sep as the delimiter string, starting at the end of the string and working to the front. If maxsplit is given, at most maxsplit splits are done. If sep is not specified, any whitespace string is a separator. Type: builtin_function_or_method
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/DaveBackus/Data_Bootcamp/issues/14#issuecomment-205369275
That's a mystery to me. Not sure what's going on there
I now made all the changes, but have yet to synch w GH. Except one. I didn't add the query. After playing around, it seemed to me not to apply to the same examples, so I skipped it. Maybe next time?
Thanks again.
On Mon, Apr 4, 2016 at 12:24 PM, Spencer Lyon notifications@github.com wrote:
That's a mystery to me. Not sure what's going on there
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/DaveBackus/Data_Bootcamp/issues/14#issuecomment-205375735
Done. Except query, which we can think about next time.
Below is a list of comments. If you want me to make any of the changes let me know:
"$123.45"
a float two times. I think just once is enough.pd.to_datetime
to convert three numeric columns for year, month, day into a single column with a pandas date time type.indexing
often (in addition tosubsetting
,filtering
, andslicing
)df
. You wrotedf[nlist]]
instead ofdf[nlist]
query
method. It is very concise and compiles the expressions and runs them in a more efficient way than we do when constructing these series/DataFrames of booleans by hand. I also like that it makes us not have to manage the boolean objects ourselves -- there's less room for bugs when we handle less temporary variables.