S3 Lookups might be doing full GET requests to S3 instead of just looking at metadata

apache / druid

Apache Druid: a high performance real-time analytics database.

https://druid.apache.org/

Apache License 2.0

13.46k stars 3.7k forks source link

S3 Lookups might be doing full GET requests to S3 instead of just looking at metadata #2894

Closed drcrallen closed 8 years ago

drcrallen commented 8 years ago

As per https://github.com/druid-io/druid/issues/2523#issuecomment-215495808 lookups regularly call org.jets3t.service.S3Service#listObjects when checking for new values.

This needs to be investigated to see if it can only check metadata and does not issue a full GET call.

pdeva commented 8 years ago

it seems i am running into this issue too: https://groups.google.com/forum/#!topic/druid-user/RUc8BNQ_6Ys

gianm commented 8 years ago

@drcrallen URIExtractionNamespaceFunctionFactory's cache populator calls puller.getVersion(uri) on the uri returned from getLatestVersion. puller.getVersion does a full getObject (does a GET). It doesn't need to do that, it could get away with getObjectDetails (does a HEAD).

Even a getObjectDetails doesn't seem necessary, since the objects that come back from listObjects have the modified dates in them.

drcrallen commented 8 years ago

@gianm thanks, I'll see if I can get it updated with that fix for 0.9.1