MusicConnectionMachine / Relationships

GNU Affero General Public License v3.0
9 stars 1 forks source link

Distributed processing #64

Open vviro opened 7 years ago

vviro commented 7 years ago

high priority

Guys, how are you preparing the tasks for the VMs to work on? Is it possible to start processing in the streaming fashion, in other words, can you start processing as soon as first results from the group 2 land in the database / blob store and continue processing them as new result come in into the database of group 2. The idea behind is that it will take group 2 a considerable time to complete updating the database. It it would be great therefore if you could process the results of group 2 during the entire process of their database update. Is this possible?

simonzachau commented 7 years ago

Originally we fetched all entities, and took their sources as inputs for our algorithms. The better plan (I don't know if we implemented it already) is to only look at the sources (which are assigned to an entity) so we don't have duplicate processing of sources. Anyhow, then we take a batch of them and wait via a promise for them to finish in order to take the next batch.

In order to take batches while they're flowing in from group 2, there needs to be an order in the database so we know which ones we already have and if there are new ones (remember the index). After we finished processing the sources we find, we could keep taking a look at the database (and process the new sources) until the number of sources does not increase anymore. What do you all think? Is there an easier way?

kordianbruck commented 7 years ago

We have two options how to do this:

The simpler solution would be the second one.

kordianbruck commented 7 years ago

Guys, so following from the azure issue #18: is it possible to actually do your mining on a distributed set? Can we scale this?

@MusicConnectionMachine/group-3 @MusicConnectionMachine/group-4

ansjin commented 7 years ago

Done with the implementation of Distributed Processing using Azure queues (https://github.com/MusicConnectionMachine/Relationships/pull/80 ) Here in a brief about how it runs:

  1. First step is the creation of queues (Queue Size limit : 5GB, Message Size Limit: 256KB, Max Time Data can kept in queues : 7Days). I have tried creating 50 queues but more can also be created.

  2. Populating the queues with the messages : The main application => query the DB => Get the websites => Get the File from the blob storage => Parse the file and get the individual web-pages from the file => Pass the web-page content as a message to queues.

*Populating the queues is done based upon round robin(first message pushed to queue 0 then to 1 and continue in this order), so that each queue get approximately equal number of messages and can be processed in parallel.

  1. Now the content is in the queues, the individual algorithm containers get the messages from the queue and find the relationships. After finding the relationships, it pushes the resulted relationships again into different queue.

    *The important part here is that, the message from the queue will not be deleted until it is completely processed(like in-between crash will not affect). Also if an algorithm container is using a message, it gets locked so that the other container cannot access it and it can continue its processing from the other queue.

  2. The other part gets the messages from the resulted queues(which has relationships/events) and save them in the DB.

The better part is that, data can stay in queues for a week so we don't have to run everything again to get the relationships. Once ran the data will be there, then at anytime we can get that data and store in DB. Also getting and storing messages from queues is a lot faster as compared to DB.

kordianbruck commented 7 years ago

@ansjin we need your help with this! Please respond in Gitter or let us know when you are free this weekend to run this

ansjin commented 7 years ago

@kordianbruck yes I will try to run this by the weekend. and here also you tagged the wrong person :)