ankane / ruby-polars

Blazingly fast DataFrames for Ruby
Other
858 stars 34 forks source link

Error when using `contains` dynamically: expected `String`, got `binary` #77

Closed ibrykov-mdsol closed 2 months ago

ibrykov-mdsol commented 2 months ago

Hi!

Obligatory, thank you for bringing polars to ruby! Also, I'm super new to data framing, so I might not have the right vocabulary.

Here's my code sample:

    df = Polars::DataFrame.new({ a: %w[123 456 abc] })
    p df.select([
      Polars.col("a").str.contains("\\d")
    ])
    p df.select([
      Polars.col("a").str.contains(/\d/.source)
    ])

It gives me the following:

shape: (3, 1)
┌───────┐
│ a     │
│ ---   │
│ bool  │
╞═══════╡
│ true  │
│ true  │
│ false │
└───────┘

Minitest::UnexpectedError: Polars::Error: invalid series dtype: expected `String`, got `binary`

As you can see, in the first case, I'm using plain"\\d". In the second case it's /\d/.source. IMO, there should be no difference between them, but the second one fails claiming it got binary instead of String. My assumption is that the first string is compile time while the second one is dynamic (run-time).

Please let me know if there is a workaround or it can be fixed in the library.

ankane commented 2 months ago

Hi @ibrykov-mdsol, the difference is one is a UTF-8 string and the other is a binary string.

"\\d".encoding        #<Encoding:UTF-8>
/\d/.source.encoding  #<Encoding:US-ASCII>

You'll need to pass a UTF-8 string to contains:

/\d/.source.force_encoding(Encoding::UTF_8)