ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
251 stars 109 forks source link

Support parquet input in local stats #783

Open DevinWu opened 4 months ago

DevinWu commented 4 months ago

Issue description When I tested local stats with parquet data as raw data, it failed because it was not supported. So I have added this part of the function.

In StatsModelProcessor, it will call AkkaStatsWorker to do stats locally, which will call ShifuFileUtils.getDataScanners to get java Scanner from user input raw data. However it doesn't support the parquet format as input, so it broke the local testing.

How to support load parquet data into the scanner:

  1. Read Parquet Group from parquet file.
  2. Convert the parquet group to String in CSV format, columns separator with Shifu output data delimiter.
  3. Convert the string to ByteArraryInputStream for the input stream.
    The above 3 steps are processed during the stream reading, so will not cache much data in memory.

Tested with stats in ShifuCLITest, it can run successfully.

image