http0 making a lot of connections where I think it should only make one

phrrngtn commented 1 year ago

Hi Alex, I don't understand the behavior of http0 in an query I am developing: I am writing out the response to a HTTP get directly to the file-system and using the SQLite CROSS JOIN to explicitly force nested loop join order. My expectation is that I would see at most one HTTP connection i.e. no concurrency. However the observed behavior is many concurrent connections. Is there a way to configure the library to limit the number of connections. I know of the http_rate_limit function.

Here is the (edited) SQL snippet and a screengrab of the procexp output on Windows.

 T1 AS (
SELECT template_render(T.socrata_template,
                           JSON_object('domain', I.domain, 'resource_id', I.resource_id)) as url,
           template_render(T.local_path_template,
                           json_object('workspace_root', @socrata_data_root, 'domain', I.domain, 'resource_id', I.resource_id)) as path
FROM T, INFRASTRUCTURE AS I
)
SELECT fileio_write(T1.path, H.response_body)
FROM T1 CROSS JOIN http_get(T1.url) AS H;

phrrngtn commented 1 year ago

I was convinced that I did not have that behavior in other parts of my code and I checked how I was doing the JOIN and when I use LEFT OUTER JOIN with the http_get on the RHS, everything works just as expected. Perhaps I am just reading the SQLite docs incorrectly. I don't have much experience debugging query plans on SQLite because in general my datasets are quite small and I have never encountered any queries that take more than a few hundred milliseconds to run.

SELECT fileio_write(T1.path, H.response_body)
FROM T1 LEFT OUTER JOIN http_get(T1.url) AS H;

asg017 commented 1 year ago

Hey Paul - first thing, I just released v0.1.1 which fixed a few bugs that might help with this, I'd try it with that to see how it is.

Secondly, I'm surprised about the multiple concurrent connections. SQLite queries are single-threaded, and sqlite-http shouldn't have multiple connections ran at the same time. On every new row of http_get, a new HTTP request is made, the response_body is read in full (when the response_body column is referenced in SQL), then the connection is closed when the next row is read in.

Here's a small sample of what I typically do when I want a new HTTP request for every row in a table:

.load ./http0
.mode box
.header on

create table demo as select value from json_each('[1, 444, 999]');

select 
  value,
  response_body ->> 'args',
  timings ->> 'start' as started_at,
  timings ->> 'body_end' as ended_at
from demo
join http_get(printf('https://httpbin.org/get?value=%d', value));

┌───────┬──────────────────────────┬─────────────────────────┬─────────────────────────┐
│ value │ response_body ->> 'args' │       started_at        │        ended_at         │
├───────┼──────────────────────────┼─────────────────────────┼─────────────────────────┤
│ 1     │ {"value":"1"}            │ 2023-08-06 01:58:04.638 │ 2023-08-06 01:58:05.045 │
│ 444   │ {"value":"444"}          │ 2023-08-06 01:58:05.046 │ 2023-08-06 01:58:05.132 │
│ 999   │ {"value":"999"}          │ 2023-08-06 01:58:05.132 │ 2023-08-06 01:58:05.256 │
└───────┴──────────────────────────┴─────────────────────────┴─────────────────────────┘

As you can see in the started_at and ended_at columns, there's only 1 request at a time. The first request ended at 2023-08-06 01:58:05.045, the next one started 1ms second later, etc.

I recommended try it with just a regular JOIN, I'm guessing the other types of joins are doing some wonky things. I don't know too much about JOINs, but to debug further, I recommended using the timings columns to figure out when exactly each requests starts/ends. If you do find some concurrency bugs, let me know!

phrrngtn commented 1 year ago

I will update to the latest revision.

I read through https://www.sqlite.org/optoverview.html#order_of_tables_in_a_join more closely and I am actually happy with the LEFT OUTER JOIN style as it kind of makes sense to me. I would not be at all surprised if the funny behavior is due to one of the virtual table methods called more often than you expect it should be.

For my non-experimental work, I record all the HTTP requests in a logging table so will see if there are any instances of overlapping (in time) connections.

phrrngtn commented 1 year ago

This is fixed by the LEFT OUTER JOIN idiom.

asg017 / sqlite-http

http0 making a lot of connections where I think it should only make one #29