datafuselabs / databend

๐——๐—ฎ๐˜๐—ฎ, ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐˜€ & ๐—”๐—œ. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.31k stars 704 forks source link

refactor: migrate TSV input format to new framework. #15506

Closed youngsofun closed 2 weeks ago

youngsofun commented 2 weeks ago

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

  1. no longer tries to read big TSV in parallel to make it simple and always able to get row id, but still can be deserialized in parallel after cutting into RowBatches.
  2. simplify the handling of skip headers.
  3. minimal reallocating of file data.

Tests

Type of change


This change isโ€‚Reviewable

github-actions[bot] commented 2 weeks ago

Docker Image for PR

note: this image tag is only available for internal use, please check the internal doc for more details.

github-actions[bot] commented 2 weeks ago

ClickBench Report

youngsofun commented 2 weeks ago

image

Q2 is loading a large TSV, slightly slower, I think it is negligible.