cube2222 / octosql

OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
Mozilla Public License 2.0
4.78k stars 202 forks source link

bufio.Scanner: token too long #299

Closed chapmanjacobd closed 1 year ago

chapmanjacobd commented 1 year ago

I'm running out of vespene gas or somteh

$ wget https://files.pushshift.io/reddit/submissions/RS_2022-08.zst
$ unzstd --memory=2048MB --stdout RS_2022-08.zst | octosql "SELECT count(*) FROM stdin.json" -o csv
...
Error: couldn't run query: couldn't run source: couldn't run source: bufio.Scanner: token too long

sad :'(

the great octopus god is able to work with this other, smaller, file in 110.6s:

$ unzstd --memory=2048MB --stdout RS_2021-08.zst | octosql "SELECT count(*) FROM stdin.json" -o csv
count
28384220

It does not use much RAM with either file so not sure what's up :? Both are similar-ish file-ish size-ish 7.8G vs 10GB compressed. maybe 200GB uncompressed

cube2222 commented 1 year ago

Hey, it's actually the line size that's the problem (it's limited to 1MB right now), but I'm happy to add a config option for this.

cube2222 commented 1 year ago

This has now been added in https://github.com/cube2222/octosql/commit/6644557f8cce2e7231b201581c98a9519d2ae132 and released in 0.11.1.

You are now able to configure the maximum line size in your ~/.octosql/octosql.yml file:

databases:
  # ...
files:
  json:
    max_line_size_bytes: 33554432

Thanks for the report!

DeluxeOwl commented 1 month ago

edit: added PR (2 loc) here https://github.com/cube2222/octosql/pull/336

Hi @cube2222, this doesn't actually work (I don't think the context is passed properly).

I've added some printing in cmd/root.go:

fmt.Printf("%+v\n", cfg)
ctx = config.ContextWithConfig(ctx, cfg)
fmt.Printf("%+v\n", ctx)

And in datasources/json/execution.go:

func (d *DatasourceExecuting) Run(ctx ExecutionContext, produce ProduceFn, metaSend MetaSendFn) error {
    fmt.Printf("from json.Run, ctx: %+v\n", ctx)

    f, err := files.OpenLocalFile(ctx, d.path, files.WithTail(d.tail))
    if err != nil {
        return fmt.Errorf("couldn't open local file: %w", err)
    }
    defer f.Close()

    sc := bufio.NewScanner(f)

    sc.Buffer(nil, config.FromContext(ctx).Files.JSON.MaxLineSizeBytes)
    fmt.Printf("from json.Run, config from context: %+v\n", config.FromContext(ctx))

And it doesnt seem like it's doing anything:

$ ./octosql/main "select * from nat_rules.json"  --describe
from root.go, config: &{Databases:[] Files:{JSON:{MaxLineSizeBytes:33554432} BufferSizeBytes:4194304}}
from root.go, context: context.Background.WithCancel.WithValue(config.contextKey, *config.Config)
Usage:
  octosql <query> [flags]
  octosql [command]

Examples:
octosql "SELECT * FROM myfile.json"
octosql "SELECT * FROM mydir/myfile.csv"
octosql "SELECT * FROM plugins.plugins"

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command
  plugin      

Flags:
      --describe         Describe query output schema.
      --explain int      Describe query output schema.
  -h, --help             help for octosql
      --optimize         Whether OctoSQL should optimize the query. (default true)
  -o, --output string    Output format to use. Available options are live_table, batch_table, csv, json and stream_native. (default "live_table")
      --profile string   Enable profiling of the given type: cpu, memory, trace.
  -v, --version          version for octosql

Use "octosql [command] --help" for more information about a command.

Error: typecheck error: couldn't create datasource: couldn't scan lines: bufio.Scanner: token too long

I'll take a more in depth look later and open a PR

edit: added PR (2 loc) here https://github.com/cube2222/octosql/pull/336

chapmanjacobd commented 1 month ago

For what it's worth I think this was working at one point--or maybe I just filtered out the long line, I don't really remember. But here is my octosql config:

$ cat ~/.octosql/octosql.yml
files:
  buffer_size_bytes: 33554432
  json:
    max_line_size_bytes: 33554432