apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 979 forks source link

HTTP plugin timeout error #2869

Closed RayetSena closed 8 months ago

RayetSena commented 9 months ago

Hello, When I make a query, I sometimes get the result I expect, but I usually get a timeout error. The result contains only 3000 rows and 3 columns. I use the http plugin for my query; you can see its configuration below. I checked for network issues and timeout settings, but I keep getting the error. When I use a more complex query, I almost always get an error. I added both queries to the below.

Configuration

{
  "type": "http",
  "connections": {
    "feed": {
      "url": "https://opendata.cbs.nl/ODataFeed/odata/",
      "requireTail": true,
      "method": "GET",
      "authType": "none",
      "inputType": "json",
      "xmlDataLevel": 1,
      "postParameterLocation": "QUERY_STRING",
      "verifySSLCert": true
    }
  },
  "retryDelay": 1000,
  "proxyType": "direct",
  "authMode": "USER_TRANSLATION",
  "enabled": true
}

Expected behavior Query

select s.flatdata.ID as ID, s.flatdata.WijkenEnBuurten as WijkenEnBuurten, s.flatdata.AantalInwoners_5 as AantalInwoners_5
from (select flatten(value) as flatdata from http.feed.`85039NED/UntypedDataSet?$top=3000&$format=json`) as s

image

Error detail, log output or screenshots image image

Drill version 1.21.1

Additional context Complex Query

select result.AantalInwoners_5 as AantalInwoners_5, result.WerkzameBeroepsbevolking_2 as WerkzameBeroepsbevolking_2,
CAST((CAST(result.AantalInwoners_5 as double) / CAST(result.WerkzameBeroepsbevolking_2 as double)) as double)*100 as Percentage from (select * from 
(select s.flatdata.ID as ID, s.flatdata.WijkenEnBuurten as WijkenEnBuurten, s.flatdata.AantalInwoners_5 as AantalInwoners_5
from (select flatten(value) as flatdata from http.feed.`85039NED/UntypedDataSet?$top=3000&$format=json`) as s union all 
select a.flatdata.ID as ID, a.flatdata.WijkenEnBuurten as WijkenEnBuurten, a.flatdata.AantalInwoners_5 as AantalInwoners_5
from (select flatten(value) as flatdata from http.feed.`85039NED/UntypedDataSet?$top=3000&$skip=3000&$format=json`) as a  union all 
select d.flatdata.ID as ID, d.flatdata.WijkenEnBuurten as WijkenEnBuurten, d.flatdata.AantalInwoners_5 as AantalInwoners_5
from (select flatten(value) as flatdata from http.feed.`85039NED/UntypedDataSet?$top=3000&$skip=6000&$format=json`) as d union all 
select f.flatdata.ID as ID, f.flatdata.WijkenEnBuurten as WijkenEnBuurten, f.flatdata.AantalInwoners_5 as AantalInwoners_5
from (select flatten(value) as flatdata from http.feed.`85039NED/UntypedDataSet?$top=3000&$skip=9000&$format=json`) as f union all 
select b.flatdata.ID as ID, b.flatdata.WijkenEnBuurten as WijkenEnBuurten, b.flatdata.AantalInwoners_5 as AantalInwoners_5
from (select flatten(value) as flatdata from http.feed.`85039NED/UntypedDataSet?$top=3000&$skip=12000&$format=json`) as b union all 
select c.flatdata.ID as ID, c.flatdata.WijkenEnBuurten as WijkenEnBuurten, c.flatdata.AantalInwoners_5 as AantalInwoners_5
from (select flatten(value) as flatdata from http.feed.`85039NED/UntypedDataSet?$top=3000&$skip=15000&$format=json`) as c union all 
select v.flatdata.ID as ID, v.flatdata.WijkenEnBuurten as WijkenEnBuurten, v.flatdata.AantalInwoners_5 as AantalInwoners_5
from (select flatten(value) as flatdata from http.feed.`85039NED/UntypedDataSet?$top=3000&$skip=18000&$format=json`) as v 
order by ID) as test INNER JOIN (select s.flatdata.WerkzameBeroepsbevolking_2 as WerkzameBeroepsbevolking_2, s.flatdata.WijkenEnBuurten as WijkenEnBuurten
from (select flatten(value) as flatdata from http.feed.`85485NED/TypedDataSet?$format=json`) as s where s.flatdata.WijkenEnBuurten like 'BU188%') as a
on test.WijkenEnBuurten=a.WijkenEnBuurten) as result
jnturton commented 9 months ago

I think this is likely to be due to rate limiting on the API you're querying. Each FROM http.feed.____ clause you've got is going to result in a separate HTTP request to the API, and probably too many of those are being sent in too short a space of time for the server's liking. I can't remember if @cgivre added a throttling feature to the HTTP plugin but another possibility might be to split the query into a few pieces, which you could separate by delays, using CTAS to keep retrieved data around locally for later queries that rely on it.

cgivre commented 9 months ago

One more thing... I was wrong. There are some throttling parameters. You can set the timeout parameter as well as a retryDelay parameter.
Best, -- C

On Jan 10, 2024, at 02:54, James Turton @.***> wrote:

I think this is likely to be due to rate limiting on the API you're querying. Each FROM http.feed.____ clause you've got is going to result in a separate HTTP request to the API, and probably too many of those are being sent in too short a space of time for the server's liking. I can't remember if @cgivre https://github.com/cgivre added a throttling feature to the HTTP plugin but another possibility might be to split the query into a few pieces using CTAS to keep retrieved data around locally for later queries that rely on it.

— Reply to this email directly, view it on GitHub https://github.com/apache/drill/issues/2869#issuecomment-1884350260, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKB7PU2W4CRC676LRBAE23YNZCMPAVCNFSM6AAAAABBSY4EASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUGM2TAMRWGA. You are receiving this because you were mentioned.