Closed phrrngtn closed 1 year ago
I was convinced that I did not have that behavior in other parts of my code and I checked how I was doing the JOIN and when I use LEFT OUTER JOIN
with the http_get
on the RHS, everything works just as expected. Perhaps I am just reading the SQLite docs incorrectly. I don't have much experience debugging query plans on SQLite because in general my datasets are quite small and I have never encountered any queries that take more than a few hundred milliseconds to run.
SELECT fileio_write(T1.path, H.response_body)
FROM T1 LEFT OUTER JOIN http_get(T1.url) AS H;
Hey Paul - first thing, I just released v0.1.1 which fixed a few bugs that might help with this, I'd try it with that to see how it is.
Secondly, I'm surprised about the multiple concurrent connections. SQLite queries are single-threaded, and sqlite-http
shouldn't have multiple connections ran at the same time. On every new row of http_get
, a new HTTP request is made, the response_body
is read in full (when the response_body
column is referenced in SQL), then the connection is closed when the next row is read in.
Here's a small sample of what I typically do when I want a new HTTP request for every row in a table:
.load ./http0
.mode box
.header on
create table demo as select value from json_each('[1, 444, 999]');
select
value,
response_body ->> 'args',
timings ->> 'start' as started_at,
timings ->> 'body_end' as ended_at
from demo
join http_get(printf('https://httpbin.org/get?value=%d', value));
┌───────┬──────────────────────────┬─────────────────────────┬─────────────────────────┐
│ value │ response_body ->> 'args' │ started_at │ ended_at │
├───────┼──────────────────────────┼─────────────────────────┼─────────────────────────┤
│ 1 │ {"value":"1"} │ 2023-08-06 01:58:04.638 │ 2023-08-06 01:58:05.045 │
│ 444 │ {"value":"444"} │ 2023-08-06 01:58:05.046 │ 2023-08-06 01:58:05.132 │
│ 999 │ {"value":"999"} │ 2023-08-06 01:58:05.132 │ 2023-08-06 01:58:05.256 │
└───────┴──────────────────────────┴─────────────────────────┴─────────────────────────┘
As you can see in the started_at
and ended_at
columns, there's only 1 request at a time. The first request ended at 2023-08-06 01:58:05.045
, the next one started 1ms second later, etc.
I recommended try it with just a regular JOIN
, I'm guessing the other types of joins are doing some wonky things. I don't know too much about JOINs, but to debug further, I recommended using the timings
columns to figure out when exactly each requests starts/ends. If you do find some concurrency bugs, let me know!
I will update to the latest revision.
I read through https://www.sqlite.org/optoverview.html#order_of_tables_in_a_join more closely and I am actually happy with the LEFT OUTER JOIN
style as it kind of makes sense to me. I would not be at all surprised if the funny behavior is due to one of the virtual table methods called more often than you expect it should be.
For my non-experimental work, I record all the HTTP requests in a logging table so will see if there are any instances of overlapping (in time) connections.
This is fixed by the LEFT OUTER JOIN
idiom.
Hi Alex, I don't understand the behavior of http0 in an query I am developing: I am writing out the response to a HTTP get directly to the file-system and using the SQLite CROSS JOIN to explicitly force nested loop join order. My expectation is that I would see at most one HTTP connection i.e. no concurrency. However the observed behavior is many concurrent connections. Is there a way to configure the library to limit the number of connections. I know of the
http_rate_limit
function.Here is the (edited) SQL snippet and a screengrab of the procexp output on Windows.