Add shared memory node-type

stv0g commented 7 years ago

This new node-type should use a POSIX shared memory segment (see shm_overview(7)) and pthreads(7) to exchange samples between external processes and VILLASnode.

(Pthreads synchronisation primitives (mutex & condition variables) are actually usable between processes as long as they are initialised correctly.)

The first use case for this new node-type might be DPsim.

Why not simply use a simpler IPC method?

There are plenty of other IPC methods available on Linux which are easier to use:

System-V message queues
Unix Pipes
Unix Domain Sockets
Named Pipes

The main problem of them is that they require work from the Kernel to function. Every IPC communication request will cause a system call to enter the kernel mode.

In contrast, when using shared memory we will stay completely in user space and won't see any latency problems due to kernel/user-space context switches.

stv0g commented 7 years ago

This is a write-up of some of the internals of VILLASnode which are important for implementing a shared memory node-type.

VILLASnode uses three main data structures to store and forward samples:

struct sample stores a sample of simulation data which describes the interface at one specific point in time by a set of values. It therefore contains a couple of timestamps plus an array of floating point or integer values. It uses reference counting to keep track which node, path or hook is currently still using this sample before it is freed.
struct pool which is used as a memory pool for fixed size allocations. Every sample is stored in such a fixed size allocation. When the reference count of a sample reaches 0 it is returned to its belonging pool.
struct queue implements a multiple-procuder / multiple-consumer queue. It is primarily used to keep pointers to struct sample. Receiving nodes push new samples at the end of the queue. Sending nodes dequeue samples from the beginning. (In reality there are more actors which manipulate the queue(s)).

For a shared memory node, we would ideally expose those data structures to the external process (e.g. DPsim). So that this process can directly manipulate them without any the need for additional processing / copying of data. This is only possible because both struct pool and struct queue a multi-threading safe.

The main challenge to make this possible is the proper allocation of memory for those data structures. Currently, we don't use malloc() & free() for that but especially mapped huge pages (see memory_*() functions.). In order to expose the queues and memory pools to the external process we would need to guarantee that they are allocated in one big continuous region of memory to avoid the necessity of creating multiple shared memory regions.

stv0g commented 7 years ago

In GitLab by @georg.reinke on Mar 27, 2017, 11:15

Okay, I looked into this a bit. The easy part would be the general structure:

For each shmem node, allocate a struct pool and two struct queues (one per direction), ensuring that everything they allocate is in the same huge page.
Expose this huge page via the shmem interface. node_read and node_write would just read or write samples from the respective queue, and the external program can just do the same on the other respective queue.

As you said, the tricky part would be to ensure that all allocated memory of the shared struct is in the same huge page. I would propose to create a new memtype that only allocates from a given huge page to do this. If we pass this memtype to pool_init and queue_init, there's no code in pool or queue that needs to be changed. The downside is that we would have to implement some kind of allocator for this memtype (even though a very simple allocator would suffice, since as far as I can tell, both structs only do a single alloc at initialisation). Also, this memtype obviously needs some kind of state, so we need to pass a pointer to it to the alloc / free functions and remove the const qualifier from most uses of memtype.

stv0g commented 7 years ago

Hi Georg,

For each shmem node, allocate a struct pool and two struct queues (one per direction), ensuring that everything they allocate is in the same huge page.

Do you mean the struct pool and struct queue which are members of struct path. Or do you want to add new queues and pools?

I would propose to create a new memtype that only allocates from a given huge page to do this. I agree. This memtype could be a wrapper around the existing memtype_hugepage.

Also, this memtype obviously needs some kind of state [...] Currently, there we declare the memory types with global declaration of type struct memtype (at the end of lib/memory.c:

There are:

memtype_heap
memtype_hugepage
memtype_dma (which is currently unused)

So instead of using one of the constant global declarations, you could store the struct memtype (including the allocator state) inside the new node-type.

Btw. the memtype_dma was planned to be used for DMA capable memory allocations. DMA memory must reside in a fixed memory area and therefore has the same design goal as the memory type which we need for the shared memory allocations.

For both we have a fixed region from which we can allocate memory. And both need a simple allocator.

I propose that you add a new memtype + allocator which does that.

Btw. I am almost done, pushing and merging my latest changes to the develop branch. It would be great if you could wait until I completed this. :-)

Are you working in the EONERC building? I am planning to visit Aachen this or next week. Maybe we can sit together and I explain some of the VILLASnode code / features.

stv0g commented 7 years ago

In GitLab by @georg.reinke on Mar 27, 2017, 19:13

Or do you want to add new queues and pools?

I guess new queues and pools would be the simplest solution; directly exposing the ones of struct path would probably be messy since then the node needs a reference to the path that it's referencing. Hopefully, the additional copy won't be a performance issue.

[...] I propose that you add a new memtype + allocator which does that.

That's exactly what I had in mind.

Btw. I am almost done, pushing and merging my latest changes to the develop branch. It would be great if you could wait until I completed this. :-)

Um, I already started to work on it based on the eric-lab branch (since develop didn't seem to compile). I hope this won't create issues when merging. b2ed08c2 is where I got this morning.

Are you working in the EONERC building? I am planning to visit Aachen this or next week. Maybe we can sit together and I explain some of the VILLASnode code / features.

Yeah, I'm usually in the GGE CIP pool. My schedule is pretty flexible, so just tell me when it suits you best.

stv0g commented 7 years ago

I guess new queues and pools would be the simplest solution

I agree. Originally, I planned it the other way. But I think we should prepare simplicity this time..

Um, I already started to work on it based on the eric-lab branch (since develop didn't seem to compile).

Okay, the eric-lab branch is fine. It should merge without conflicts into mine.

stv0g commented 7 years ago

In GitLab by @georg.reinke on Mar 29, 2017, 17:50

It just occured to me while reading up on the POSIX shared memory API that the data structures stored in the shared memory area can't use normal pointers since the memory area is generally mapped to different virtual addresses in the different processes. So we'd have to modify the pointers inside struct pool and struct queue to be stored as offsets if we want them to be shared across processes. Would you be ok with that?

stv0g commented 7 years ago

Oh good point. Yes I think we have to use offset.

Or alternatively, the mmap() call which maps the shared segment allows you to propose a fixed mapping address. If you use the same mapping address in both processes it should work..

In this case, we would need to pass the shared memory identifier and the mapping address to the DPsim process.

stv0g commented 7 years ago

Here is a proposal for the configuration for the new shmem node:

nodes = {
    shmem_test_node = {
        type = "shmem",

        id = "shared_memory_id_for_dpsim"

        exec = "dpsim-executable"
        args = [ "-c", "/etc/dpsim/config" ],
        queuelen = 512 /* should be optional */
    }
}

I think VILLASnode could start DPsim by using the settings exec and args.

We could either pass the shared memory id and the mapping address via additional command line arguments to DPsim, or use something configurable in the config.

stv0g commented 7 years ago

In GitLab by @georg.reinke on Mar 29, 2017, 18:48

The configuration sample looks good to me.

Or alternatively, the mmap() call which maps the shared segment allows you to propose a fixed mapping address.

That's what I thought of at first, too, but we'd have to somehow determine a virtual memory region that is free for both processes. I don't know how we could do that easily since we can't really influence where other shared libraries might be mapped.

stv0g commented 7 years ago

According to this article there is something called memory mapping segment in the VM of a process where all ´mmap()` regions are allocated from.

However due to ASLR this segment can be anywhere :-S

I think we should go for offsets. But these offsets should be relative to the shared memory mapping which is always aligned to a page size.

stv0g commented 7 years ago

How do we determine the size of the shared memory region?

stv0g commented 7 years ago

In GitLab by @georg.reinke on Mar 29, 2017, 19:40

But these offsets should be relative to the shared memory mapping which is always aligned to a page size.

What would these offsets be then for structs not in the shared region? I was thinking of offsets relative to the struct itself, like this:

struct queue {
  size_t buf_off; // instead of void* buf
  //...
};

void queue_somefunc(struct queue *q /* ... */) {
  void *buf = (char*) q + q->buf_off;
  // use buf like q->buf before
}

How do we determine the size of the shared memory region?

I guess we should calculate the total size of everything that needs to be shared based on the configured queue size.

stv0g commented 7 years ago

In GitLab by @georg.reinke on Apr 6, 2017, 12:46

As you saw, I already pushed my current work to the node-shm branch. Basic data exchange using the simple client and configuration file is already working (both with and without condition variables, though the CPU usage is obviously high without them).

Todo:

Possibly use huge pages for the shared memory region (based on a configuration option?)
Maybe split the CV-extended queue into a wrapper structure and use this everywhere we're currently using queues with CVs.
Decide on an API for external programs. This includes:
- What to put in the shared lib for the program. Since the program needs the queue- and pool-related functions, just linking against libvillas.so is probably the easiest solution.
- How to pass configuration options. The external program needs to know at least the name of the shmem object and its size at runtime. The example client reads this from the villas-node configuration. We could write a function that does the same thing and put it in the API; thus, we only need one configuration file, and the people working on the external program need to worry less about the configuration.
- How to start the external program in a synchronized way. The cleanest solution would be to start the external program from VILLASnode after initialising the shmem object. It would probably also good to synchronize the actual start of the data exchange using something like a pthread_barrier_t, so we don't unnecessarily fill the queue while the external program is still starting up.

stv0g commented 7 years ago

Hi Georg,

Possibly use huge pages for the shared memory region (based on a configuration option?)

I would make this the default. We don't have to save memory..

Maybe split the CV-extended queue into a wrapper structure and use this everywhere we're currently using queues with CVs.

:+1:

[...] just linking against libvillas.so is probably the easiest solution.

Unfortunately, this is not an option because libvillas.so is linking against several LGPLv2 libraries. DPsim is likely to licensed with a more permissive license like MIT or Apache. As MIT / Apache is not compatible with LGPLv2, we can not link libvillas.so with DPsim. My solution to this problem would be a second library which only includes the pool, sample, queue data structures. Those are developed by ourself. We can apply two licenses (MIT + LGPLv2) for them.

I know this is annoying. And it wasn't my idea to care so much about the licensing :-(

How to pass configuration options.

I would avoid using the configuration file. We've already parsed and processed the configuration options in VILLASnode. We don't have to do that again. I would just pass the struct shm via shared memory.

Another problem is that libconfig which we are using to parse the config is LGPLv2 licensed. I don't think we can link it again MIT? I am not 100% sure here..

How to start the external program in a synchronized way

:+1: I agree. That will be important at some later point in time.

stv0g commented 7 years ago

In GitLab by @georg.reinke on Apr 7, 2017, 11:58

Apparently, huge pages only work for anonymous mappings, so we can't use it for shared memory.

stv0g commented 7 years ago

In GitLab by @georg.reinke on Apr 7, 2017, 12:44

Are you sure about the licensing stuff? AFAIK, even proprietary programs may link against (unmodified) LGPL libs.

stv0g commented 7 years ago

In GitLab by @georg.reinke on Apr 12, 2017, 13:29

I thought about the shutdown problem some more. If we want to do this properly, I don't see a way around extending struct queue with a new method like queue_close that causes any calls to queue_read in other threads to return immediately. I propose the following semantics for this:

Reading on closed queues returns any items still in the queue; reading on closed, empty queues fails immediately. Any blocked read calls (by definition only possible if the queue is empty) also fail if the queue is closed.
Writing to a closed queue fails immediately. (A single write is atomic regarding closing, writing multiple items may fail in the middle if the queue is closed)
Closing an already closed queue fails immediately.

This should be relatively easily implemented with an additional atomic flag. We furthermore have to add a similar method to queue_signalled that additionaly signals the CV.

stv0g commented 7 years ago

In GitLab by @georg.reinke on Apr 12, 2017, 13:50

Okay, for some reason, I was under the impression that queue_read busy-waits when I wrote the above. This solves the problem for normal queues, but for queue_signalled we still have to signal the CV on close. Then, the closed flag can be just put in shmem_shared and we don't need to modify the queue code.

stv0g commented 7 years ago

We already have an enum state in the queue. Can we use STATE_STARTED, STATE_STOPPED for this? Together with queue_(signalled_)?{start,stop,init,destroy}()?

Then its more consistent with the remaining objects and datastrctures.

stv0g commented 7 years ago

In GitLab by @georg.reinke on Apr 13, 2017, 16:10

I already implemented proper shutdown in 73b44117 without any changes to queue. We don't actually need a new queue_close function for this.

stv0g commented 7 years ago

Hi @georg.reinke,

I am just wondering, do you plan to create a merge request for that? Or are you already done with the feature?

stv0g commented 7 years ago

In GitLab by @georg.reinke on Apr 15, 2017, 15:40

I forgot to push the latest changes. I'm basically done implementation-wise, but I was going to look over everything again before sending the merge request (for example, making sure that libext doesn't link against any other libs).

stv0g commented 7 years ago

:+1: I am looking forward to merge it :-)

stv0g commented 7 years ago

I @georg.reinke,

FYI, I pushed some commits to shm-node.

I also try to reuse code between lib/nodes/shmem.c and lib/shmem.c. So there will be some more commits today / tomorrow.

stv0g commented 7 years ago

closed via merge request !16

VILLASframework / node

Add shared memory node-type #52

Why not simply use a simpler IPC method?