crowdrec / idomaar

CrowdRec reference framework
Apache License 2.0
32 stars 12 forks source link

Investigating how to send input file "directly" through a serialization framework, and not using a shared folder #10

Closed robertoturrin closed 9 years ago

robertoturrin commented 10 years ago

Currently input files are shared from the reference-framework to the computing environment by means of a shared folder. This task consists of investigating alternative approaches (e.g., FileMQ?), especially to support streaming of data.

vigsterkr commented 10 years ago

so i've been experimenting & reviewing FileMQ (https://github.com/zeromq/filemq). it would be a good way to distribute the datasets, but currently it supports only a handful of programming languages.

I've implemented simple file transfer using 0MQ building blocks. but i feel like that this is actually adding overhead to the system as it basically it streams the whole file through the TCP connection once. Hence without reading the file once seeking within the file is not supported. Although I wonder if this feature, i.e. seeking in the data set file, is at all important in our use-case as we are training on the whole data set. which means that the algorithm will need to read the whole file once anyways.

Anyhow, due to the seeking support and that we don't need to read the file at least twice (in case of 0MQ or any file transfer based we'd need once transferring the file and once reading it) I like the folder/file sharing approach better than the transferring.

any comments/insights about this?

davidemalagoli commented 9 years ago

we moved the architecture to kafka