MrPowers / farsante

Fake Pandas / PySpark DataFrame creator
42 stars 6 forks source link

Join datasets generation #8

Closed SemyonSinchenko closed 1 year ago

SemyonSinchenko commented 1 year ago

Main refactoring is about moving generation logic into a separate file generators.rs where the core abstraction RowGenerator is placed.

For this trait there are four implementations:

  1. GroupBy generator (mostly old code, just refactored)
  2. Join generator LHS
  3. Join generator RHS medium
  4. Join generator RHS small

Also cli args description was improved a little.

SemyonSinchenko commented 1 year ago

@MrPowers @jeffbrennan Hi guys! I finished here and would be happy to get any feedback from your reviews!

P.S. @MrPowers If you have an end2end pipeline for benchmarks, may I ask you to test results of generators also?

MrPowers commented 1 year ago

@SemyonSinchenko - this is sweet.

The next step is probably to figure out how to expose this Rust code via Python APIs. That's what the delta-rs project does. The code is written in Rust and the Python APIs are exposed via pyo3.

It would be awesome if the Python users of farsante could just access all these functions.