apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.56k stars 1.4k forks source link

Problem with a cat #2836

Open asfimport opened 10 months ago

asfimport commented 10 months ago

$ parquet cat train-00000-of-00001-15a05aeec7726f9d.parquet                        

Unknown error

shaded.parquet.org.apache.avro.SchemaParseException: Illegal character in: original-instruction

at shaded.parquet.org.apache.avro.Schema.validateName(Schema.java:1607)

at shaded.parquet.org.apache.avro.Schema.access$400(Schema.java:92)

at shaded.parquet.org.apache.avro.Schema$Field.(Schema.java:556)

at shaded.parquet.org.apache.avro.Schema$Field.(Schema.java:595)

at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:295)

at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:279)

at org.apache.parquet.cli.util.Schemas.fromParquet(Schemas.java:89)

at org.apache.parquet.cli.BaseCommand.getAvroSchema(BaseCommand.java:405)

at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:66)

at org.apache.parquet.cli.Main.run(Main.java:163)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.parquet.cli.Main.main(Main.java:193)

the data set in question is: https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-en/tree/main/data

Reporter: Rémy Léone / @remyleone

Original Issue Attachments:

Note: This issue was originally created as PARQUET-2378. Please see the migration documentation for further details.

asfimport commented 10 months ago

Gang Wu / @wgtmac: Thanks for reporting the issue! I can reproduce it on my end. Let me investigate it.

asfimport commented 10 months ago

Jiashen Zhang / @zhangjiashen:

This error is expected because some names using '-' in schema are invalid, it is thrown from validateName in https://github.com/apache/avro/blob/branch-1.11/lang/java/avro/src/main/java/org/apache/avro/Schema.java, please double check?

asfimport commented 10 months ago

Gang Wu / @wgtmac: Can we get rid of the schema conversion via AvroSchemaConverter? This file is created by the C++ parquet writer from Apache Arrow. So it does not have to do the conversion.

asfimport commented 10 months ago

Jiashen Zhang / @zhangjiashen: This Cat supports multiple file formats - Parquet, Avro, Text, it internally needs to converts MessageType to Avro Schema if it is a parquet format, which doesn't support '-' . What do you think we add a new command to cat parquet file format without converting to Avro Schema and we can directly print content given a parquet file? Below is some code sample:


  String input = <parquet file>;

  ParquetReader<SimpleRecord> reader = null;
  try {
    PrintWriter writer = new PrintWriter(Main.out, true);
    reader = ParquetReader.builder(new SimpleReadSupport(), new Path(input)).build();
    ParquetMetadata metadata = ParquetFileReader.readFooter(new Configuration(), new Path(input));
    JsonRecordFormatter.JsonGroupFormatter formatter = JsonRecordFormatter.fromSchema(metadata.getFileMetaData().getSchema());

    for (SimpleRecord value = reader.read(); value != null; value = reader.read()) {
      value.prettyPrint(writer);
      writer.println();
    }
  } finally {
    if (reader != null) {
      try {
        reader.close();
      } catch (Exception ex) {
      }
    }
  }
}

Output sample:


.......   

id = 15012
category = open_qa
original-instruction = What is the difference between a road bike and a mountain bike?
original-context = 
original-response = Road bikes are built to be ridden on asphalt and cement surfaces and have thin tires, whereas mountain bikes are built to be ridden on dirt and have wider tires. Road bikes also have more aerodynamic handle bars while mountain bike handle bars a built for less responsive steering while bouncing around off the road.
new-instruction:
.user_id:
..list:
.value:
..list:
...item = What is the difference between a road bike and a mountain bike?
.status:
..list:
...item = submitted
new-context:
.user_id:
..list:
.value:
..list:
...item = 
.status:
..list:
...item = submitted
new-response:
.user_id:
..list:
.value:
..list:
...item = Road bikes are built to be ridden on asphalt and cement surfaces and have thin tires, whereas mountain bikes are built to be ridden on dirt and have wider tires. Road bikes also have more aerodynamic handle bars while mountain bike handle bars a built for less responsive steering while bouncing around off the road.
.status:
..list:
...item = submitted

id = 15013
category = general_qa
original-instruction = How does GIS help in the real estate investment industry?
original-context = 
original-response = Real estate investors depend on precise, accurate location intelligence for competitive insights about the markets and locations where they do business. Real estate investment teams use GIS to bring together location-specific data, mapping, and visualization technology. This enables them to provide the latest insights about real estate markets and their investments, now and in the future. Using thousands of global datasets, investors can quickly understand how their real estate investments are performing across town or around the world, quickly access precise local data about real estate assets, on any device, anywhere, anytime, including information on occupancy, building maintenance, property valuation, and more.Real estate companies and investors use GIS to research markets, identify new opportunities for growth and expansion, and manage their investments at the market and neighborhood levels. They can also use GIS to create professional digital and printed materials—such as 3D renderings and virtual walk-throughs—to help market investments across platforms. Real estate investors can use mobile data collection tools to gather property information directly from the field and analyze and share insights across their organizations in real time. Investors can leverage precise local knowledge about their assets across geographies. GIS maps and dashboards help investors see, in real-time, relevant data that can affect properties, and streamline investment management with access to all relevant data about every asset in any portfolio.
new-instruction:
.user_id:
..list:
.value:
..list:
...item = How does GIS help in the real estate investment industry?
.status:
..list:
...item = submitted
new-context:
.user_id:
..list:
.value:
..list:
...item = 
.status:
..list:
...item = submitted
new-response:
.user_id:
..list:
.value:
..list:
...item = Real estate investors depend on precise, accurate location intelligence for competitive insights about the markets and locations where they do business. Real estate investment teams use GIS to bring together location-specific data, mapping, and visualization technology. This enables them to provide the latest insights about real estate markets and their investments, now and in the future. Using thousands of global datasets, investors can quickly understand how their real estate investments are performing across town or around the world, quickly access precise local data about real estate assets, on any device, anywhere, anytime, including information on occupancy, building maintenance, property valuation, and more.Real estate companies and investors use GIS to research markets, identify new opportunities for growth and expansion, and manage their investments at the market and neighborhood levels. They can also use GIS to create professional digital and printed materials—such as 3D renderings and virtual walk-throughs—to help market investments across platforms. Real estate investors can use mobile data collection tools to gather property information directly from the field and analyze and share insights across their organizations in real time. Investors can leverage precise local knowledge about their assets across geographies. GIS maps and dashboards help investors see, in real-time, relevant data that can affect properties, and streamline investment management with access to all relevant data about every asset in any portfolio.
.status:
..list:
...item = submitted

id = 15014
category = general_qa
original-instruction = What is the Masters?
original-context = 
original-response = The Masters Tournament is a golf tournament held annually in the first week of April at Augusta National Golf Club in Augusta, Georgia.  The Masters is one of four Major golf tournaments and the only one to be played at the same course every year.  The course is renowned for its iconic holes, impeccable groundskeeping, and colorful flowers that are typically in bloom.  The winner earns a coveted Green Jacket and a lifetime invitation back to compete.  Many players and fans consider The Masters to be their favorite tournament given these traditions and the historical moments that have occurred in past tournaments.
new-instruction:
.user_id:
..list:
.value:
..list:
...item = What is the Masters?
.status:
..list:
...item = submitted
new-context:
.user_id:
..list:
.value:
..list:
...item = 
.status:
..list:
...item = submitted
new-response:
.user_id:
..list:
.value:
..list:
...item = The Masters Tournament is a golf tournament held annually in the first week of April at Augusta National Golf Club in Augusta, Georgia.  The Masters is one of four Major golf tournaments and the only one to be played at the same course every year.  The course is renowned for its iconic holes, impeccable groundskeeping, and colorful flowers that are typically in bloom.  The winner earns a coveted Green Jacket and a lifetime invitation back to compete.  Many players and fans consider The Masters to be their favorite tournament given these traditions and the historical moments that have occurred in past tournaments.
.status:
..list:
...item = submitted

Or Print with Json format, such as:


{"id":"15012","category":"open_qa","original-instruction":"What is the difference between a road bike and a mountain bike?","original-context":"","original-response":"Road bikes are built to be ridden on asphalt and cement surfaces and have thin tires, whereas mountain bikes are built to be ridden on dirt and have wider tires. Road bikes also have more aerodynamic handle bars while mountain bike handle bars a built for less responsive steering while bouncing around off the road.","new-instruction":{"user_id":{"list":[{}]},"value":{"list":[{"item":"What is the difference between a road bike and a mountain bike?"}]},"status":{"list":[{"item":"submitted"}]}},"new-context":{"user_id":{"list":[{}]},"value":{"list":[{"item":""}]},"status":{"list":[{"item":"submitted"}]}},"new-response":{"user_id":{"list":[{}]},"value":{"list":[{"item":"Road bikes are built to be ridden on asphalt and cement surfaces and have thin tires, whereas mountain bikes are built to be ridden on dirt and have wider tires. Road bikes also have more aerodynamic handle bars while mountain bike handle bars a built for less responsive steering while bouncing around off the road."}]},"status":{"list":[{"item":"submitted"}]}}}
{"id":"15013","category":"general_qa","original-instruction":"How does GIS help in the real estate investment industry?","original-context":"","original-response":"Real estate investors depend on precise, accurate location intelligence for competitive insights about the markets and locations where they do business. Real estate investment teams use GIS to bring together location-specific data, mapping, and visualization technology. This enables them to provide the latest insights about real estate markets and their investments, now and in the future. Using thousands of global datasets, investors can quickly understand how their real estate investments are performing across town or around the world, quickly access precise local data about real estate assets, on any device, anywhere, anytime, including information on occupancy, building maintenance, property valuation, and more.\n\nReal estate companies and investors use GIS to research markets, identify new opportunities for growth and expansion, and manage their investments at the market and neighborhood levels. They can also use GIS to create professional digital and printed materials—such as 3D renderings and virtual walk-throughs—to help market investments across platforms. Real estate investors can use mobile data collection tools to gather property information directly from the field and analyze and share insights across their organizations in real time. Investors can leverage precise local knowledge about their assets across geographies. GIS maps and dashboards help investors see, in real-time, relevant data that can affect properties, and streamline investment management with access to all relevant data about every asset in any portfolio.","new-instruction":{"user_id":{"list":[{}]},"value":{"list":[{"item":"How does GIS help in the real estate investment industry?"}]},"status":{"list":[{"item":"submitted"}]}},"new-context":{"user_id":{"list":[{}]},"value":{"list":[{"item":""}]},"status":{"list":[{"item":"submitted"}]}},"new-response":{"user_id":{"list":[{}]},"value":{"list":[{"item":"Real estate investors depend on precise, accurate location intelligence for competitive insights about the markets and locations where they do business. Real estate investment teams use GIS to bring together location-specific data, mapping, and visualization technology. This enables them to provide the latest insights about real estate markets and their investments, now and in the future. Using thousands of global datasets, investors can quickly understand how their real estate investments are performing across town or around the world, quickly access precise local data about real estate assets, on any device, anywhere, anytime, including information on occupancy, building maintenance, property valuation, and more.\n\nReal estate companies and investors use GIS to research markets, identify new opportunities for growth and expansion, and manage their investments at the market and neighborhood levels. They can also use GIS to create professional digital and printed materials—such as 3D renderings and virtual walk-throughs—to help market investments across platforms. Real estate investors can use mobile data collection tools to gather property information directly from the field and analyze and share insights across their organizations in real time. Investors can leverage precise local knowledge about their assets across geographies. GIS maps and dashboards help investors see, in real-time, relevant data that can affect properties, and streamline investment management with access to all relevant data about every asset in any portfolio."}]},"status":{"list":[{"item":"submitted"}]}}}
{"id":"15014","category":"general_qa","original-instruction":"What is the Masters?","original-context":"","original-response":"The Masters Tournament is a golf tournament held annually in the first week of April at Augusta National Golf Club in Augusta, Georgia.  The Masters is one of four Major golf tournaments and the only one to be played at the same course every year.  The course is renowned for its iconic holes, impeccable groundskeeping, and colorful flowers that are typically in bloom.  The winner earns a coveted Green Jacket and a lifetime invitation back to compete.  Many players and fans consider The Masters to be their favorite tournament given these traditions and the historical moments that have occurred in past tournaments.","new-instruction":{"user_id":{"list":[{}]},"value":{"list":[{"item":"What is the Masters?"}]},"status":{"list":[{"item":"submitted"}]}},"new-context":{"user_id":{"list":[{}]},"value":{"list":[{"item":""}]},"status":{"list":[{"item":"submitted"}]}},"new-response":{"user_id":{"list":[{}]},"value":{"list":[{"item":"The Masters Tournament is a golf tournament held annually in the first week of April at Augusta National Golf Club in Augusta, Georgia.  The Masters is one of four Major golf tournaments and the only one to be played at the same course every year.  The course is renowned for its iconic holes, impeccable groundskeeping, and colorful flowers that are typically in bloom.  The winner earns a coveted Green Jacket and a lifetime invitation back to compete.  Many players and fans consider The Masters to be their favorite tournament given these traditions and the historical moments that have occurred in past tournaments."}]},"status":{"list":[{"item":"submitted"}]}}} 
asfimport commented 9 months ago

Gang Wu / @wgtmac: Sorry for the late reply. I'm not sure if it is a good idea to add a new command. Since both head and cat commands have the same issue, can we try/catch the exception and use the JsonRecordFormatter you proposed as a fallback solution if the avro schema conversion fails?