apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

Interpretation of underlying logic of pyarrow manipulate hive #9026

Closed svjack closed 3 years ago

svjack commented 3 years ago

i used pyarrow to handle hdfs files in hive. And i review the source code of pyarrow. The mainly utilities about hdfs filesystem are function's about parquet, many about io and meta or schema inferred which is rich to use it. Another aspect is plain read function , read as text to manipulate text file in hdfs file system. as i know if i create table in hive by default the save format is text. and when i use HdfsFileSystem to deep into the truly path in hdfs of hive. It seems like the schema and meta info (and the auto parsing of delimited lines)of table can't retrieved by internal api. There i don't want to use sql tools as pyhive or others to make it as a "two source"(one from abstract sql another from plain file system) problem. even it is simple. So at present, i must use pd.read_csv with the f returned by fs.open and retrieve schema info from mysql's TBLS where the detail schema info truly located of hive metastore. I think this design is not perfect. So i want to know is that, did i omit some details about the underlying logic about pyarrow related with hdfs file system in hive ? please make a interpretation for me. All about this is pyarrow internal construction instead of other framework. And i also want to have a brief introduction about dataset api 's function about hive's parquet file and text file. Can you give me some examples about them, mainly about text save format in hive's hdfs. I also take a glare a datas source transport toolkit called sqoop, in its AppendUtils.java file it use some detail partition manipulates toolkit to perform data append and i think all functions can be rebuilder with pyarrow. But as i review the source code about pyarrow , i can not find some developed logic about "partition" and 'warehouse' manipulation. Did some one build some projects use pyarrow or arrow's other api which have implement these function ?

jorisvandenbossche commented 3 years ago

So at present, i must use pd.read_csv with the f returned by fs.open and retrieve schema info from mysql's TBLS where the detail schema info truly located of hive metastore. I think this design is not perfect.

Arrow doesn't have functionality to natively interact with or understand hive metastores. So if you have a CSV file stored, and you want to read this following the schema stored in the hive metastore, then you will at the moment always need to do something manually like what you described above.