JuliaIO / Parquet.jl

Julia implementation of Parquet columnar file format reader
Other
116 stars 32 forks source link

Parquet Writer #66

Closed xiaodaigh closed 4 years ago

xiaodaigh commented 4 years ago

Thanks for initiating this. This seems like a good start. Any idea why the CI has failed? I had a quick look and have added a few suggestions. Will be looking at it more in the next few days.

Seems to be failing due to nonmissingtype which is not available in Base prior to Julia 1.3. Added. Hope no further issues. Missings.jl is added to deps for this reason.

I added CategoricalyArrays.jl purely so that I can use it to tell users it's not supported, please let me know if you think it's ok.

Update: It's passing all tests now.

xiaodaigh commented 4 years ago

Done all the changes. Please review. IT's passing all tests locally on 1.0.5 and 1.4.1. The resultant file can be read by R and Python.

xiaodaigh commented 4 years ago

The remaining question is little-endian ness of the write. Since it passes all tests I'd say merge it and if it errors on non-little-endian machines, then ppl can raise a bug report?

Since it's passing tests on travis-ci, it's useful as it is for majority of users.

tanmaykm commented 4 years ago

Thanks for the update. I think it is not too difficult to take care of the endianness, will try an suggest something. I think this can be merged soon, but I would like to take just a bit more time to mull on what the end user API should be and do some changes if we should at this stage.

xiaodaigh commented 4 years ago

Not sure why it's failing appveyor

Edit

It's passing appveyor too now.

Keen to hear ur ideas about endianness, I must confess I don't have a great deal of knowledge there; i don't know how Parquet plays with endianness. I always assume that I am working with little-endian machines.

I will try to do any changes you think of in the API as soon as I can.

xiaodaigh commented 4 years ago

Resolved conflict with master

xiaodaigh commented 4 years ago

Broke by #81 #82

Tried to accommodate #81 but #82 is definitely a bug, as R and Python can read it OK.

tanmaykm commented 4 years ago

I feel we should have a write API that allows data to be written iteratively. Something like:

parwriter = ParFileWriter(path)

for chunk in table_chunks # chunk being a table itself
    write(parwriter, chunk)
end

Each iteration will add a row group to the file. And the file metadata can be updated at the end of the iteration. Looks like the underlying functions are already there here. So it should not be difficult. What do you think?

xiaodaigh commented 4 years ago

What do you think?

I am ok with that in addition to an API for writing a full table. Cos the table_chunks have to come from somewhere, so for ppl to use it they would need a way to chunk tables, which is rarely done. I'd say 90% of use cases are where the user just provides a Table so it's most straightforward to just write it.

How about this? We merge this one first, if no objection to the write API as it is, then in a future release we write this chunk based writer (to cover what I guess is just 10% of use cases).

xiaodaigh commented 4 years ago

More serious issue is that tests broke by #82

xiaodaigh commented 4 years ago

Merged latest changes from master and passing all tests. If there are no further changes requested from the writer, can I suggest we merge this. And implement the chunk based writer in the next release?

tanmaykm commented 4 years ago

:+1: will merge in a bit

tanmaykm commented 4 years ago

Oops, should have squashed and merged. Maybe I should rebase on the master, else it will become too unwieldy?

tanmaykm commented 4 years ago

doesn't look possible now

xiaodaigh commented 4 years ago

Maybe I should rebase on the master, else it will become too unwieldy?

Feel free. To me it's not that big a deal. But it can be harder to pin down an issue