[Question] Can i read parquet data from HDFS?

NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training

Apache License 2.0

950 stars 200 forks source link

[Question] Can i read parquet data from HDFS? #443

Closed wangxingda closed 8 months ago

wangxingda commented 9 months ago

I recompile hugectr with -DENABLE_HDFS=ON， i get an this error when i read parquet data from HDFS.

[HCTR][07:19:46.939][ERROR][RK0][tid #139631458772544]: Runtime error: Library Dependency Error. Rebuild with Arrow::Parquet Library res (next_source @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/file_source_parquet.cpp:119) [HCTR][07:19:46.939][ERROR][RK0][tid #139631458772544]: Runtime error: failed to read a file Error_t::BrokenFile (read_new_file @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/row_group_reading_thread.cpp:255) [HCTR][07:19:46.966][ERROR][RK0][tid #139631441987136]: Runtime error: Library Dependency Error. Rebuild with Arrow::Parquet Library res (next_source @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/file_source_parquet.cpp:119) [HCTR][07:19:46.966][ERROR][RK0][tid #139631441987136]: Runtime error: failed to read a file Error_t::BrokenFile (read_new_file @ /hugectr/build/HugeCTR/HugeCTR/src/data_readers/row_group_reading_thread.cpp:255)

JacoCheung commented 9 months ago

Hi @wangxingda , Thanks for trying HugeCTR with HDFS. We used to have a notebook sample demonstrating the usage of HDFS. Can you confirm that there exists a _metadata.json file in your dataset source folder? (follow the instructions of the notebook sample)

In addition could you please post your cmake log here? I'd confirm the macro ENABLE_ARROW_PARQUET is defined or not.

wangxingda commented 8 months ago

@JacoCheung Thanks for your help. I just use CMakeLists.txt with main branch in hugectr repo. And i confirm my metadata file is exists.

Do you notice this line if(Parquet_FOUND AND NOT ENABLE_HDFS AND NOT ENABLE_S3 AND NOT ENABLE_GCS) in CMakeLists.txt ? Does this mean that I cannot use both "parquet" and "HDFS" at the same time?

The notebook seems to be out of date, I can not run it successfully with both parquet and HDFS.

JacoCheung commented 8 months ago

Hi @wangxingda , Thanks for reminding! There was a destructive change to the remote reading in v23.02 release where if(Parquet_FOUND AND NOT ENABLE_HDFS AND NOT ENABLE_S3 AND NOT ENABLE_GCS) in CMakeLists.txt came into play.

Specifically, to optimize the reading process in HugeCTR, we had to know the row_group_size of all training data files (Parquet) in advance (Before any actual data reading). And the way of getting the information is to resort to arrow parquet reader reading the metadata from parquet file from local filesystem.

Therefore, HDFS should be disabled since v23.02 release. We should mark it as a known issue. Sorry for the inconvenience.

May I know the reason and importance of trying HDFS? Is it a toy trial or not? Could you try out the release prior to v23.02 release if you need HDFS feature support in the short term.

wangxingda commented 8 months ago

@JacoCheung Thanks, I plan to use hugectr in a production environment. The training data strore in HDFS. So does hugectr-team have a plan to support HDFS with parquet format? I hope to support this feature very much.

JacoCheung commented 8 months ago

Hi @wangxingda , Thanks for your reply. Yes, we're surely to restore the remote IO (HDFS) feature. As I mentioned, this should be an issue to be fixed. We're planning to refactor our data reader and fix the HDFS problem. But before that happen, you can play with release prior to v23.02 release.

Thanks.

wangxingda commented 8 months ago

Thanks very much!