dotnet / corefxlab

This repo is for experimentation and exploring new ideas that may or may not make it into the main corefx repo.
MIT License
1.46k stars 345 forks source link

Update FromArrowRecordBatches for dotnet-spark #2978

Closed pgovind closed 4 years ago

pgovind commented 4 years ago

2 things going on in this PR:

  1. Update FromArrowRecordBatch just in case we have a RecordBatch with a StructArray in it. We'll flatten out the StructArray into a regular DataFrame. Once this goes in, I'll open another PR to update the version number for MDA.
  2. Update the Arrow dependency to the latest version. This will prevent accidental "API not found" errors at runtime in the dotnet-spark repo.

OLD The following methods on DataFrameColumn are being made public:

  1. GetArrowField
  2. GetMaxRecordBatchLength
  3. ToArrowArray

These 3 methods are the ones we need to support Spark 3.0.

There is an argument to be made here that these APIs should remain protected. The alternative we have here is to update just the existing DataFrame.ToArrowRecordBatches() method to return a Spark 3.0 compatible RecordBatch. Because dotnet-spark's dependencies on MDA are specified as exact versions, this should work and no backend changes would be needed on the dotnet-spark side! I'm inclined to update DataFrame.ToArrowRecordBatches() personally, but I don't mind making these 3 methods public either.

pgovind commented 4 years ago

Now that we've determined that Spark is unlikely to need this new API, we can keep the Struct_childColumnName I think. Other than that, this PR should be good to go in