Closed xiaodaigh closed 4 years ago
Done all the changes. Please review. IT's passing all tests locally on 1.0.5 and 1.4.1. The resultant file can be read by R and Python.
The remaining question is little-endian ness of the write. Since it passes all tests I'd say merge it and if it errors on non-little-endian machines, then ppl can raise a bug report?
Since it's passing tests on travis-ci, it's useful as it is for majority of users.
Thanks for the update. I think it is not too difficult to take care of the endianness, will try an suggest something. I think this can be merged soon, but I would like to take just a bit more time to mull on what the end user API should be and do some changes if we should at this stage.
Not sure why it's failing appveyor
Edit
It's passing appveyor too now.
Keen to hear ur ideas about endianness, I must confess I don't have a great deal of knowledge there; i don't know how Parquet plays with endianness. I always assume that I am working with little-endian machines.
I will try to do any changes you think of in the API as soon as I can.
Resolved conflict with master
Broke by #81 #82
Tried to accommodate #81 but #82 is definitely a bug, as R and Python can read it OK.
I feel we should have a write API that allows data to be written iteratively. Something like:
parwriter = ParFileWriter(path)
for chunk in table_chunks # chunk being a table itself
write(parwriter, chunk)
end
Each iteration will add a row group to the file. And the file metadata can be updated at the end of the iteration. Looks like the underlying functions are already there here. So it should not be difficult. What do you think?
What do you think?
I am ok with that in addition to an API for writing a full table. Cos the table_chunks
have to come from somewhere, so for ppl to use it they would need a way to chunk tables, which is rarely done. I'd say 90% of use cases are where the user just provides a Table so it's most straightforward to just write it.
How about this? We merge this one first, if no objection to the write API as it is, then in a future release we write this chunk based writer (to cover what I guess is just 10% of use cases).
More serious issue is that tests broke by #82
Merged latest changes from master and passing all tests. If there are no further changes requested from the writer, can I suggest we merge this. And implement the chunk based writer in the next release?
:+1: will merge in a bit
Oops, should have squashed and merged. Maybe I should rebase on the master, else it will become too unwieldy?
doesn't look possible now
Maybe I should rebase on the master, else it will become too unwieldy?
Feel free. To me it's not that big a deal. But it can be harder to pin down an issue
Seems to be failing due to
nonmissingtype
which is not available in Base prior to Julia 1.3. Added. Hope no further issues.Missings.jl
is added to deps for this reason.I added CategoricalyArrays.jl purely so that I can use it to tell users it's not supported, please let me know if you think it's ok.
Update: It's passing all tests now.