lawik commented 1 year ago

I have an archive that has a lot of small files and while I haven't measured to confirm I'm pretty sure the streaming is slowing down a lot (download drops from multiple Mb/s to a few Kb/s) as the overhead/latency of each file fetch is more significant than the transfer time.

Thousands of files in this case. It does complete eventually but it would be neat to be able to ask Packmatic to buffer at least X bytes forward beyond the current need or similar.

Is there a way to do it that I've missed or would this be a good addition in your eyes?

evadne commented 1 year ago

Are you able to draw a waterfall graph of each file that needed to be fetched or provide some details as to the size of files and the relative time to first byte?

There are many ways to pre-compute parts, yes this would include using an intermediary layer to concurrently prepare some responses etc

lawik commented 1 year ago

I think this reproduces the problem in a reasonable experimental way, a livebook:

Packmatic starvation

Mix.install([:packmatic, :kino, :kino_vega_lite])
alias VegaLite, as: Vl

Section

delay = Kino.Input.number("Delay, ms", default: 300) |> Kino.render()
file_size = Kino.Input.number("File size, kb", default: 512) |> Kino.render()
entry_count = Kino.Input.number("Files", default: 200)

chart =
  Vl.new(width: 400, height: 400)
  |> Vl.mark(:line)
  |> Vl.encode_field(:x, "x", type: :quantitative)
  |> Vl.encode_field(:y, "y", type: :quantitative)
  |> Kino.VegaLite.new()

t1 = System.os_time(:millisecond)

log_event = fn event ->
  t2 = System.os_time(:millisecond)
  offset = t2 - t1

  case event do
    %Packmatic.Event.EntryUpdated{stream_bytes_emitted: bytes} ->
      seconds = offset / 1000

      if seconds > 0 do
        kb_per_second = bytes / 1024 / seconds
        IO.inspect(kb_per_second, label: "kb/s")
        point = %{x: seconds, y: kb_per_second}
        Kino.VegaLite.push(chart, point)
      end

    %Packmatic.Event.EntryCompleted{} ->
      IO.inspect(event)

    _ ->
      nil
  end

  :ok
end

latency = Kino.Input.read(delay)
size = Kino.Input.read(file_size)

small_remote_file = fn ->
  # Overhead latency for request
  :timer.sleep(latency)
  # size 512 kb
  {:ok, {:random, size * 1024}}
end

count = Kino.Input.read(entry_count)

entries =
  1..count
  |> Enum.map(fn num ->
    [
      source: {:dynamic, small_remote_file},
      path: "#{num}.txt"
    ]
  end)

{t, _} =
  :timer.tc(fn ->
    entries
    |> Packmatic.build_stream(on_event: log_event)
    |> Stream.run()
  end)

IO.inspect(t / 1000, label: "took ms")
IO.inspect(count * latency, label: "entries * delay, ms")

lawik commented 1 year ago

Pasting a livebook in github is kinda weird :D

lawik commented 1 year ago

Gist: https://gist.github.com/lawik/b660a9a38061ffb4e7ae4f4378d2ccb9

evadne commented 11 months ago

@lawik Revisiting the problem there are some solutions around this

Make the Encoder concurrent by default with some adjustable prefetching logic but then make it tunable
Write a ConcurrentEncoder and keep both
Optimise connection setup, by using Hackney and pooling the connections we should be fast ish to a certain degree

It would depend on:

Whether the sources are on different hosts that resolve to different IP/port pairs which would require separate connections

Whether the individual files are large or small

Etc

There is also another solution which is to keep the encoding entries hot-addable so you have a producer and the consumer just goes on and on until it gets an end message. Then the intermediary layer can be added

lawik commented 11 months ago

I no longer have the problem because we optimized away the need for about 7000 files and suddenly things are quite snappy.

The last option you mention would let the developer determine their own level of look-ahead. I assume this could be modeled as a a stream of entries instead of a finalized list?

evadne / packmatic

Read-ahead or concurrent fetching? #13

Packmatic starvation

Section