Closed Kaliahh closed 2 years ago
Forslag til mulige klasser:
Simulering - Christian(Done)
Rendering - Jakob(Done)
Clustering (Data analyse) - Mikael(Done)
AI træning - Jakob(Done)
Modellering
Matematiske udregninger (medicin, kemi) - Mikael(Working)
Hosting af distribueret web-service - Jakob(done)
Er parralelisering løsningen?
Video transcoding
Image and video rendering
According to wikipedia: Rendering is an embarrassingly parallel workload in multiple domains (e.g., pixels, objects, frames) and thus has been the subject of much research.
Distributed rendering: https://www.awsthinkbox.com/blog/distributed-rendering-a-guide Distributed rendering is a rendering technique where multiple machines across a network render a single frame of a scene or image. The frame is divided into smaller regions, and each machine receives some of them to render. Once each region has been rendered, it is returned to the client machine and combined with other rendered regions to form the final image.
The scene file with details including but not limited to: geometry, viewpoint, texture, lighting, and shading information describing the virtual scene.
An image
Have to solve "The rendering equation" https://dl.acm.org/doi/abs/10.1145/15922.15902 Which is a matematical formula. There are also rendering programs/frameworks/engines that can be used to do this. So basically, run a rendering engine, with the scene data.
Rendering is when the computer calculates the light in our scene to create the final image or animation. To calculate the lighting the render engine needs information from our scene. This includes, but is not limited to, things like:
https://artisticrender.com/how-to-render-in-blender/
Ray tracing vs rasterization
Can run blender from the command line https://docs.blender.org/manual/en/latest/advanced/command_line/render.html Would need to have blender installed
Dependent on the input data. Also needs a rendering application/engine installed to carry out the actual rendering If the rendering of a single scene is split, the results needs to be combined.
In summary: There is potential, distributing training of Deep Neural Networks is an active research area.
A provider would need to install a program that can train a given DNN, but this seems feasible.
In some of the approaches, there seems to be a need for a lot of communication between the nodes/providers.
Data parallelism
Task parallelism
Hybrid parallelism
Model parallelism
Article seems to have the same point/starting point that we do. Distribution is good/needed for training DNN's, but it is difficult for researchers to set this up in a way that is efficient, as it is not their area of expertise.
Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud.
It works like this: your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud
Towards Federated Learning at Scale: System Design
In synchronized data-parallel distributed deep learning, the major computation steps are:
Clustering algorithms is the one of the main methods of data mining large data sets for further analysis. This is usually done on a centralized server where all the data is necessary for the data mining have been collected in advance. Distributed networks become relevant for data mining when the data is too large for a centralized server to handle, either cause sending the server the necessary data would take to long to be useful or cause the server cannot handle all the data on its own in a reasonable time frame.
In this scenario, each decentralized server that has data relevant for the central server, will do the relevant data mining separately on their own sets of data, sending the result to the central server for further data mining. There are many more issues than this here like dealing with duplicate data, but they are all user related and have thus been omitted. This solution is also applying if the issue is that data from the decentralized servers may not be accessed from the central one due to security concerns.
In this scenario, the central server takes the main data and divides it into pieces and sends them to a series of decentralized servers to handle. From there it's like scenario 1, where each decentralized server data mines the data and send the results back to the central server for further analysis.
https://academic.csuohio.edu/fuy/Pub/tcdp01.pdf https://www.researchgate.net/publication/318536108_Data_mining_in_distributed_environment_a_survey
The video game Borderlands 3 contains a minigame that is part of a large distributed network help analyse DNA sequences for microorganisms. The minigame is used to automate a tedious task in microbiome research, sorting the microbes into similar groupings and classifying them. Human involvement was deemed necessary as humans are good at recognizing “close enough” pairings of the microbes unlike computers. This task has been made fun by obfuscating the more tedious parts of it and giving the players rewards for the main game for completing mini games, especially if they perform well.
Input: Continuous stream of data based on user input trough a minigame, creating possible pairing of microbes
Output: An ever-evolving list of microbes, where each microbe eventually gets paired with one or more mircrobes
Source: https://www.digitaltrends.com/cool-tech/borderlands-minigame-citizen-science/
A use for distributed systems in the field of neuroscience is working with and researching uses of brain-computer interfaces (BCI), with a focus on signal acquisition, feature extraction and classification.
Standard used for distributed network: “Common Object Request Broker Architecture (CORBA) is a standard defined by the Object Management Group (OMG), designed to facilitate the communication of systems that are deployed on diverse platforms. CORBA enables collaboration between systems on different operating systems” - Wiki
Input: Neural signals acquired from a device, referred to as raw EEG signals. Can be stored as standard ASCII or binary files.
Output: An action or behaviour, which can be interpreted as an executable command.
Consideration: For the distributed system to be useful outside of just research, it must be a continuous system that is capable of reading from a device that is connected to the BCI system.
Source: https://www.researchgate.net/publication/43649583_Application_of_Distributed_System_in_Neuroscience_A_Case_Study_of_BCI_Framework http://www.ois.com/Products/what-is-corba.html https://en.wikipedia.org/wiki/Common_Object_Request_Broker_Architecture
• Pentesting, security • Particle physics • Weather, climate and geography • Economics and finance • Bioinformatics
Source: https://www.researchgate.net/publication/43649583_Application_of_Distributed_System_in_Neuroscience_A_Case_Study_of_BCI_Framework -- som citerer --> https://vowi.fsinf.at/images/b/bc/TU_Wien-Verteilte_Systeme_VO_%28G%C3%B6schka%29_-_Tannenbaum-distributed_systems_principles_and_paradigms_2nd_edition.pdf
Content delivery/distribution network (CDN) is a web service that is inherently distributed. There is potential, but I fear that it has little to no relevance to non-commercial users.
CDN: Content Distribution Network 2004
A survey of peer-to-peer content distribution technologies 2004
Globally distributed content delivery 2002
It seems that much hosting is inherently distributed. An application runs on multiple nodes in a cluster, and a load balancer handles distributing requests. My initial assumption is that distributing across privatly owned commercial/commodity hardware, is not practical in most instances.
One problem with doing this may be Tail latency Tail latency, also known as high-percentile latency, refers to high latencies that clients see fairly infrequently.
Cost-Effective Geo-Distributed Storage for Low-Latency Web
Services
Seems to attempt to handle exactly this. Perhaps a similar approach could be devised for privatly owned commodity hardware.
However there is the question of whether or not doing so has any relevance.
For the computations of agent-based simulations, each agent will contain local state variables along with a set of actions that the agent can perform. The simulation will be split into phases. In the first phase all agents will take a decision after that, they will all act on their decision. In the next phase, each agent will update its local state, based on the action taken by the other agents in the system. In the last phase, the scenario model will be updated. Hereafter, the phases are repeated a fixed number of times, which makes up the execution of the simulation.
Depending on the framework used to create the agent-based simulation, the computions needed to run the simulation can be distributed over multiple nodes. The framework used in creating the distributed simulation in, is the PanSim + Sim-2APL framework developed at the university of Virginia. This framework uses a network of computation nodes to simulate a scenario.[1]
Input: a fixed number of agents, a model which describe the agents’ behaviors, and a simulation scenario.
Output: statistics about the simulation of the given scenario.
Link 1: https://parantapa.net/mypapers/bhattacharya-emas21.pdf https://link.springer.com/chapter/10.1007/978-3-030-77964-1_32 http://cecosm.yolasite.com/resources/Accepted_Scientometrics_ABM_Website.pdf
Co-simulation is a technique, which makes use of sub-systems to simulate the behavior of a real-world component, such as a braking system or the motor of a car. The type of these sub-systems can range from embedded systems to automotive industry systems.
During simulation all the included sub-systems run concurrently and communicate with each other in order to simulate a unified whole, e.g. a ca
The paper “ProMECoS: A Process Model for Efficient Standard-Driven Distributed Co-Simulation”[1], describes how they have designed a distributed co-simulation tool. This tool uses a master slave architecture where the Master initializes all slave systems at the start of a simulation. Hereafter, the master will only provide extern features such as a global clock for the system and any other extern actors, which is not simulated by any of the sub-systems.
The slaves either are the real sub-systems or they simulate the behavior of the sub-systems. A slave can communicate with other slaves or with the master using the “Distributed Co-Simulation Protocol” (DCP), this protocol supports the communication protocols used by the sub-systems.[1]
The result of a simulation, depends on the data were logged while the simulation was running.
Slave inputs: The master sends a DCPX file, at start up, to all slaves. Each slave has an XML file, which describes the structure of the slave, such as elements and attributes, supplementary assertions, and constraints.
Slaves output: A slave can directly update its own variables during the simulation. Through communication the slave can inform the other slaves of changes to the overall system state.
Master output: Sends updates to the slaves, via DCPX files.
Prima Source: Link: [1] https://www.mdpi.com/2079-9292/10/5/633/htm Link: https://github.com/modelica/DCPTester
Other Links https://projects.au.dk/into-cps/academia/co-simulation/ https://www.mscsoftware.com/product/co-simulation https://abaqus-docs.mit.edu/2017/English/SIMACAEANLRefMap/simaanl-c-cosimulationover.htm
Hvordan er problemet defineret? AI træning, simulering, BIG DATA, osv.
Hvad følger med for at kunne løse problemet (data, kode)?
Hvordan ser outputtet ud?
Hvordan udregnes resultatet? Er det en executable, er det kode der skal kompileres?
Hvor meget er udregningen afhængig af eksterne faktorer, såsom andre processer, mellemregninger eller shared memory?
Hvilke klasser bør vi fokusere på? Er der nogle der har noget til fælles der gør det oplagt at vi vælger at fokusere på dem, frem for andre? Er der nogle klasser der er for afhængige af eksterne faktorere til at det er noget vi vil arbejde med nu?
Er parralelisering løsningen?