blackrock / xml_to_parquet

Convert one or more XML files into Apache Parquet format. Only requires a XSD and XML file to get started.
Apache License 2.0
33 stars 20 forks source link

Block size issue with large files #2

Open varsha1288 opened 3 years ago

varsha1288 commented 3 years ago

I am trying with the block size option, to increase the block size, since I have 50 MB file.

Can you please provide some input on what needs to be done.

I get the pyArrow error - straddle block size

Thanks !!

elibixby commented 3 years ago

FYI, I was getting the same error, turns out there's a pretty obvious reason when you look at the code. XML is tree based, whereas Parquet is columnar. The way this code serializes an xml file is essentially as a single row in a parquet database, even if you restrict it to repeated elements using the XPath argument. This is not at all how I would expect a large XML file to be transferred to a database.

It's pretty easy to write a parser that processes individual elements into rows in a Parquet database though (and doesn't do the unnecessary step of JSON serialization in between).

import xml.etree.ElementTree as ET
import pyarrow
import pandas as pd

def parse(row_schema, xml_file):
    rows = []
    for _, element in ET.iterparse(xml_file):
        if row_schema.is_valid(element):
            rows.append(row_schema.decode(element, validation='skip'))

     return pyarrow.Table.from_pandas(pd.DataFrame(rows))

The trick is just using the complex type in the schema object that defines the repeated element you want to define the columns in your Parquet file. (look for it in myschema.complex_types

davlee1972 commented 4 months ago

It's been a while since I've looked at this..

The normal use case is to pass in a xml path which is typically a repeating element which would get converted to a parquet row. -p XPATHS, --xpaths XPATHS

If no xpath is passed in then your entire XML is parsed into a single parquet row which takes up a ton of memory and would be a very odd use case to store columnar data in a single row..

The xml parser normally tosses out xml elements from memory when a xml end tag is reached..

if xpath is set to /prices/price

<prices>
<date_sent>2024-06-01</date_sent>
<price>123</price> **is tossed out of xml memory when converted into a python row**
<price>456</price> **is tossed out of xml memory when converted into a python row**
<price>789</price> **is tossed out of xml memory when converted into a python row**
</prices>

**python rows are converted into a pyarrow table and then written to a columnar parquet file** 

Someday I'll probably implement some sort of counter to dump x number of python rows into a parquet row group and append it to the parquet file to free up python memory..