constellation-rs / amadeus

Harmonious distributed data analysis in Rust.
https://constellation.rs/amadeus
Apache License 2.0
473 stars 26 forks source link

Examples under amadeus-parquet have some problems #22

Open vertexclique opened 4 years ago

vertexclique commented 4 years ago

Hi!

I was gazing to the parquet parser example. Examples under README.md https://github.com/constellation-rs/amadeus/blob/master/amadeus-parquet/src/README.md has some problems. I couldn't make them run.

I have a couple of suggestions for the parquet crate:

Where I simply want a structure in an array and java's arrow serializer serialized the structs with the name of array and didn't add yet another nested level. If you add a guide for this nested structure that would be nice.

alecmocatta commented 4 years ago

Hi Mahmut and thanks for the issue!

Apologies I’m on my phone on holiday this week but I’ll try my best to respond.

  • There is no standalone parquet reader at the outside, would be really nice to have this parquet reader as a separate non-dependent package.

Agreed; I think standalone parquet reader interfaces might just about exist in the amadeus-parquet crate - I’ll fix its docs.rs build this weekend, and adding examples has been on my TODO list for a while! https://github.com/constellation-rs/amadeus/blob/master/amadeus-parquet/src/README.md is in dire need of updating, I’ll get to that when I can.

The List data type should be usable to read any of the various back-compat variations of lists, which I believe includes what you’re describing. It should work for List<MyStruct> where MyStruct is a struct that derives Data. If that doesn’t work then it’s most likely a bug in the parse_list method here. Printing the expected schema (which I guess is happening for you as part of an error message?) will, IIRC, print the “normal” parquet list schema rather than the also valid potential “compat” schemas - which is potentially confusing matters here?

Are you able to include the full error you’re seeing, and if possible a code sample, and I’ll do my best to resolve!

On Fri, 13 Dec 2019 at 17:53, Mahmut Bulut notifications@github.com wrote:

Hi!

I was gazing to the parquet parser example. Examples under README.md https://github.com/constellation-rs/amadeus/blob/master/amadeus-parquet/src/README.md has some problems. I couldn't make them run.

I have a couple of suggestions for the parquet crate:

  • There is no standalone parquet reader at the outside, would be really nice to have this parquet reader as a separate non-dependent package.
  • some serdes are creating different names for List. For example with my structures Data derivation proc-macro does this: I got this:

    REQUIRED group events (LIST) { REPEATED group list { REQUIRED group element {

where I expect this:

REQUIRED group events (LIST) { REPEATED group array {

Where I simply want a structure in an array and java's arrow serializer serialized the structs with the name of array and didn't add yet another nested level. If you add a guide for this nested structure that would be nice.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/constellation-rs/amadeus/issues/22?email_source=notifications&email_token=AAIVM5XJFQTMWEIQWKETZX3QYO4YDA5CNFSM4J2QIJX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IAL4BHQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIVM5W2ALSF7OXF6AT3J6DQYO4YDANCNFSM4J2QIJXQ .

vertexclique commented 4 years ago

Hi Alec;

Thanks for your response. That is very helpful. Let me add my observations here:

Printing the expected schema (which I guess is happening for you as part of an error message?) will, IIRC, print the “normal” parquet list schema rather than the also valid potential “compat” schemas - which is potentially confusing matters here?

That is solved. Even the field names are different at the nested level like I've showed in my first comment. If a person comes with having this nested issue. Here is the solution because the current macro is generating flat levels(which is good don't worry):

#[derive(Data, Debug, Clone)]
pub struct Event {
    pub event_id: Option<String>
}

#[derive(Data, Debug, Clone)]
pub struct Events {
    pub array: List<Event>
}

#[derive(Data, Debug, Clone)]
pub struct EventStream {
    pub events: Events,
}

This solved my problems and you can discard my comment over it.

BUT I have one more important thing to comment on because that is very important in some analytical workloads. The columnar format readers are sometimes meant to not read the whole schema. I saw that you have implemented the argument taking for the subset (it is called projection in code):

    /// Creates row iterator for all row groups in a file.
    pub fn from_file(_proj: Option<Predicate>, reader: R) -> Result<Self> {
        let file_schema = reader.metadata().file_metadata().schema_descr_ptr();
        let file_schema = file_schema.root_schema();
        let schema = <Root<T> as ParquetData>::parse(file_schema, None)?.1;

        Ok(Self::new(Some(reader), None, schema))
    }

Is it possible to have this predicate pushdown? If this is implemented (or I can help in implementation) I will abandon the original implementation of parquet.

alecmocatta commented 4 years ago

I’m glad you’ve got it working! FWIW for the schema you wrote before I would also expect the following to work:

[derive(Data, Debug, Clone)]

pub struct Event { pub event_id: Option }

[derive(Data, Debug, Clone)]

pub struct EventStream { pub events: List, }

I’ll have a look into it on my return.

Predicate pushdown would be really cool. IIRC the statistics used for it are being read, though not used. I think my intention was possibly adding an associated type to the ParquetData trait that can store predicates. This would be replace Predicate in the API, and be tested against the statistics before reading each column chunk and page.

I’ll have a look into it this weekend and perhaps we can hash out the design on chat.

On Tue, 17 Dec 2019 at 12:01, Mahmut Bulut notifications@github.com wrote:

Hi Alec;

Thanks for your response. That is very helpful. Let me add my observations here:

Printing the expected schema (which I guess is happening for you as part of an error message?) will, IIRC, print the “normal” parquet list schema rather than the also valid potential “compat” schemas - which is potentially confusing matters here?

That is solved. Even the field names are different at the nested level like I've showed in my first comment. If a person comes with this nested issue here is the solution because current macro is generating flat levels:

[derive(Data, Debug, Clone)]

pub struct Event {

pub event_id: Option<String>

}

[derive(Data, Debug, Clone)]

pub struct Events {

pub array: List<Event>

}

[derive(Data, Debug, Clone)]

pub struct EventStream {

pub events: Events,

}

This solved my problems and you can discard my comment over it.

BUT I have one more important thing to comment on because that is very important in some analytical workloads. The columnar format readers are sometimes meant to not read the whole schema. I saw that you have implemented the argument taking for the subset (it is called projection in code):

/// Creates row iterator for all row groups in a file.

pub fn from_file(_proj: Option, reader: R) -> Result {

  let file_schema = reader.metadata().file_metadata().schema_descr_ptr();

  let file_schema = file_schema.root_schema();

  let schema = <Root<T> as ParquetData>::parse(file_schema, None)?.1;

  Ok(Self::new(Some(reader), None, schema))

}

Is it possible to have this predicate pushdown? If this is implemented (or I can help in implementation) I will abandon the original implementation.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/constellation-rs/amadeus/issues/22?email_source=notifications&email_token=AAIVM5VFUMU3UYLHKLGEGQTQZCWSBA5CNFSM4J2QIJX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHCAA7A#issuecomment-566493308, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIVM5VR6UPSTYXNB5CUD6LQZCWSBANCNFSM4J2QIJXQ .

vertexclique commented 4 years ago

Would be cool, I will jump to chat soon.