dmlc / dmlc-core

A common bricks library for building scalable and portable distributed machine learning.
Apache License 2.0
861 stars 518 forks source link

[New dependencies required] Support parquet #650

Closed bridgream closed 3 years ago

bridgream commented 3 years ago

I've added support for Parquet files to dmlc-core. I did this to enable external memory support for Parquet files in XGBoost. I have tested my code under the XGBoost framework by training two models with identical parameters but using CSV and Parquet files. The two models generate identical predictions on the same test data. (unit test code not included in this pull request)

However, my implementation depends on Apache Arrow Parquet. Although I plan to make Parquet support optional, I am aware that the parsers are registered in src/data.cc. As the register is static, I don't know how to add the parser into the register optionally without affecting existing code. Can anybody give any advice?

Thanks in advance!

@PeterPanOnGit @trivialfis

trivialfis commented 3 years ago

?

trivialfis commented 3 years ago

Thanks for working on this. I haven't looked into your code but I have put it in my to-do list at Saturday if you are still interested in it.

bridgream commented 3 years ago

@trivialfis thank you for your reply! I've re-opened the pull request and make parquet support optional (disabled by default and should not affect users that do not need this feature). Would you please move to that PR?

trivialfis commented 3 years ago

Yup, also @hcho3