Open MarcoGorelli opened 11 months ago
I'm happy to add something like this! I'd prefer if we had a non-trivial set of rules in mind (e.g., at least five or so?) before we started to add them. I want to avoid a situation in which we create a new category, add one rule, then fail to expand it to a meaningful set.
Sure, thanks! For a start, there's all the rewrites from https://github.com/pola-rs/polars/issues/9968, such as
- pl.col('a').map_elements(lambda x: np.sin(x))
+ pl.col('a').sin()
- pl.col('a').map_elements(lambda x: x+1)
+ (pl.col('a') + 1)
- pl.col('a').map_elements(lambda x: json.loads(x))
+ pl.col("a").str.json_extract()
- pl.col('a').map_elements(lambda x: dt.datetime.strptime(x, "%Y-%m-%d"))
+ pl.col('a').str.to_datetime(format='%Y-%m-%d')
- pl.col('a').map_elements(lambda x: x.upper())
+ pl.col("a").str.to_uppercase()
. Within Polars, warnings are emitted for some of these by parsing the bytecode of the passed function - but as Ruff deals with the AST, then I'd expect it to be possible to cover a lot more from that list
The full list of test cases is here, there's quite a few already:
Any read operation followed by a lazy is very fishy.
E.g. pl.read_parquet(..).lazy()
should suggest pl.scan_parquet(..)
.
And that for all our scan supported file types.
One more suggestion in the 'lazy' category:
- DataFrame(...).lazy()
+ LazyFrame(...)
Maybe one for assertions (the equality statements would result in an error):
- assert s1 == s2
+ assert_series_equal(s1, s2)
- assert df1 == df2
+ assert_frame_equal(df1, df2)
- assert lf1 == lf2
+ assert_frame_equal(lf1, lf2)
- assert s1 != s2
+ assert_series_not_equal(s1, s2)
...
One for select
/with_columns
:
- df.select(pl.all(), ...)
+ df.with_columns(...)
- df.select(pl.col("*"), ...)
+ df.with_columns(...)
Keyword syntax in select
/with_columns
:
- df.select(pl.col('a').abs().alias('abs'))
+ df.select(abs=pl.col('a').abs())
Keyword syntax in filter
:
- df.filter(pl.col('a') == 'foo')
+ df.filter(a='foo')
Using positional args instead of lists where possible:
- df.sort(['a', 'b'])
+ df.sort('a', 'b')
...I'm sure I can come up with more :smile: @MarcoGorelli Is this enough input?
Hello,
I've noticed that
ruff
has apandas-vet
plugin. Would you open to adding a Polars-vet one?It could make suggestions such as
or
which can have a real impact on performance
I could try putting something together if you'd be open to it