Closed xiaodaigh closed 4 years ago
Hmm, i wonder if it needs to be merged with #51 as well. Let me try it once i get back
I see, the files can't be read by the ParquetFiles implementation which is relying on some row-wise iteration cursor (?not 100 sure). The codebase will take some time to digest and I am not sure if it's worth it. Because, parquet is a columnar-based format and a row-based iteration to read the data seems to defeat the purpose a little. Perhaps the original author can comment.
Bottom line, I will not spend time and energy fixing the row iteration approach. Instead, I will work to put the column-based approach that works in Diban.jl back into Parquet.jl but also release Diban.jl if the PR to parquet.jl takes too long.
I do think we should only merge PRs here that don't break downstream packages, so keeping the row streaming stuff intact would be good.
Yeah, I will try not to touch the row iterator thing and try to update the column-wise read.
The thing is, if it's broken for 70% of files I tested, are there many ppl dependent on it and if there are why are they?
Thanks @xiaodaigh . I can confirm the changes look fine.
It will be good to have two clean separate commits - one with the changes to update thrift specs, and the other with the fix to reader.jl. Also, add a test for this condition, maybe using the same synthetic_data.parquet
. Will wait a bit, if you would like to make those changes.
Thanks @xiaodaigh . I can confirm the changes look fine.
It will be good to have two clean separate commits - one with the changes to update thrift specs, and the other with the fix to reader.jl. Also, add a test for this condition, maybe using the same
synthetic_data.parquet
. Will wait a bit, if you would like to make those changes.
Ok. I will break it into two
:+1: Just some rebase to get two separate commits in this PR would be fine too.
Done
see #64 #63
This doesn't seem to work using the synthetic data file, which I think should be added to the test suite and tested against (either here or downstream in ParquetFiles.jl).
My code:
The error: