MusicConnectionMachine / UnstructuredData

In this project we will be scanning unstructured online resources such as the common crawl data set
GNU General Public License v3.0
3 stars 1 forks source link

Progress indication #179

Closed sacdallago closed 7 years ago

sacdallago commented 7 years ago

Have means to check:

Something in the fashion:

Worker1: /////////////////50%____ Worker2: //////////////////////70% Worker3: /////////20%___ Worker4: ////////////////////////////90%____

Total: /////////////////50%_____

felixschorer commented 7 years ago

How does console.log() work when randomly connecting via SSH? Will it actually show up in the terminal?

sacdallago commented 7 years ago

You can pipe the output to a file on the remote machine (since you are running the scripts in background?), or even better on the caller machine (random first hit on google).

On the master: you should know when an ssh connection is dropped --> process on remote completed --> x WET files have been parsed

pfent commented 7 years ago

You could also go the full loop and setup a small webserver (e.g. 7 LoC with express), to which we POST the current progress 🤔

vviro commented 7 years ago

I would suggest you take a look at the queues, they would make the whole pipeline much less brittle and more efficient. http://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/sqs-examples-using-queues.html http://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/sqs-examples-send-receive-messages.html https://docs.microsoft.com/en-us/azure/storage/storage-nodejs-how-to-use-queues

The code generating the tasks would be as complex as

var wetUrlList = getWetUrls(); // a function that returns an array of WET file urls to be fetched from Amazon
for (url in wetUrlList) {
  queueSvc.createMessage('taskQueue', url, function(error, result, response){
    if(!error) {
      // Message is inserted, otherwise probably try to insert it again later
    }
  });
}

The code consuming the tasks from the queue would look like

var azure = require('azure-storage');
var queueSvc = azure.createQueueService();

function pullTask() {
  queueSvc.getMessages('taskQueue', {visibilityTimeout: 5 * 60}, function(error, result, response){
    if(!error){
      if (messages.length > 0) {
      // Message text is in messages[0].messageText
      var wetUrl = result[0];
      // process(wetUrl)
      // after the processing successfully finishes, remove the task from the queue
      queueSvc.deleteMessage('taskQueue', message.messageId, message.popReceipt);
      pullTask();
     } else {
       setTimeout(pullTask, 30*1000);
     }
    }
    setTimeout(pullTask, 10*1000);
  });
}

pullTask();

And that's it. So, you just push all the tasks into the queue and see how the number of messages in the queue decreases while your slaves are processing the tasks (here you have your progress indicator). No explicit partitioning of tasks is required, no complexity, no processing same messages again, no caring for idle nodes (unless the code crashes, of course), no SSHing into the nodes, and many additional benefits. All with the simple code above. Why not do this instead of building on top of the current brittle process?

lukasstreit commented 7 years ago

I like @vviro's approach, my only worry is the additional debugging implementing a new solution would bring at this point. We could use a webservice that starts the whole processing on a post request and returns information on the current progress on a get. Obviously some kind of authentication would be required here.

I don't really see how this would take care of sshing into the machines to deploy and start node quite yet, though.

Auto-shutdown might be possible by sending a queue-item called 'shutdown' from the webservice to the machine which triggers the machine to run the shutdown command in shell.

vviro commented 7 years ago

@LukasS96 you cannot send a message to a particular machine over a queue, that's not how they are supposed to be used. Yes, auto-killing dead machines is something that needs to be taken care of by other means. But the point is, that barring the NodeJS process dying, there is no need to kill the machines because they will always have tasks unless everything is done, at which point we can kill all the machines.

vviro commented 7 years ago

there is also ne need to ssh into the machines because the node process can be started without any custom parameters when the machine is initialized.

felixschorer commented 7 years ago

@vviro where does messages come from in the above code? Also use for (let item of array), in is for object properties and would only return the indexes of an array 😄

vviro commented 7 years ago

I know, I know, that's for brevity and for you catching this - well done :+1:

felixschorer commented 7 years ago

Queue implemented. Progress can be monitored by the amount of items in the queue.