Integrate S3 source into idomaar flume source

crowdrec / idomaar

CrowdRec reference framework

Apache License 2.0

32 stars 12 forks source link

Integrate S3 source into idomaar flume source #39

Closed davidemalagoli closed 8 years ago

davidemalagoli commented 9 years ago

I think that this is already implemented, right?

andras-sereny commented 9 years ago

No, I think currently we only use IdomaarSource in the Flume configs, either with an HTTPStream or FileStream reader. Also, S3 as source in Flume seems to be bleeding edge: https://issues.apache.org/jira/browse/FLUME-2437

davidemalagoli commented 8 years ago

@andras-sereny I've created a very simple implementation (using aws sdk and http protocol) to integrate the s3 source (see commit). Could you please give me a feedback on:

implementation & test
how to pass credential to orchestrator (currently is using default .aws/credentials file but we should pass it to orchestrator in some way)
how to modify the "--data-source" parameter to support the s3:// uri

thanks

andras-sereny commented 8 years ago

Hi @davidemalagoli , I've introduced the option in newsreel-test.sh to get data from S3: if the env variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are set, the test script will try fetch the example file from S3.

The idomaar.sh start script passes the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from the original environment to the orchestrator environment, which then passes them on to any process it starts.

I'll do some work on this to make sure idomaar-demo.sh works with S3.

davidemalagoli commented 8 years ago

Perfect, let me know when you'll finish so I'll update wiki and documentation.

Thanks!

davidemalagoli commented 8 years ago

Hi @andras-sereny did you have time to complete tests?

andras-sereny commented 8 years ago

Sorry, I was sick and out of office for rather long. I'll finish it this week.

andras-sereny commented 8 years ago

Hi @davidemalagoli, as of 88f324def3c04296ae856415654f0a8282721c72 the idomaar-demo.sh can read data from S3 (f the env variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are set), but on the aws branch it fails at spark the evaluation part:

INFO [datastream] File "/vagrant/evaluator/eval.py", line 34, in evalRecall INFO [datastream] GTList = set([k['object']['id'] for k in x['GT']['expected']['evidences']]) ERROR [datastream] TypeError: 'int' object has no attribute 'getitem'