Introduce ports to handle tasks like device scan in a generic way

b0661 commented 2 years ago

Currently there are special functions to scan for devices on serial and can ports. In my case the devices are connected by SPI. So I would have to add another special function and do some code duplication.

Instead introduce ports that have general functions to scan for devices that are connected to the ports and also for other functions. The ports may be modeled (similar to devices) with a generic structure that can be filled with the the special functions for a specific type of port. The structure should ideally be constant to not consume precious RAM in constrained devices.

The concept of port may be not only applied to downstream links but also to upstream links.

martinjaeger commented 2 years ago

That sounds like a very good idea. I could imagine the following different ports:

Downstream (device side)

UART
CAN
SPI
I2C
LoRaWAN (mainly pub/sub, very high latency request/response possible)

Upstream (host/cloud side)

HTTP/CoAP (request/response only)
MQTT (mainly pub/sub, but also request/response possible)

As a long-term goal I would also like to port this firmware to Zephyr, as the ESP32 support in Zephyr has become much better recently (which was not the case when we started writing this firmware). I'm wondering if Zephyr would provide some better features than ESP-IDF in order to generalize these ports (e.g. as modules).

Using an approach with Zephyr modules would potentially allow to implement an upstream port directly in a device if it has IoT connectivity on board, reusing the same code.

Gretel5X commented 2 years ago

I like the idea, let's clarify some aspects for me so I could try to implement something. The "special function" will be special and has to be implemented for every downstream, there is not much room for generic stuff (UART is not a bus, we don't know how many devices there are on CAN etc.). What you want is much like the TSDevice struct with e.g. the function pointer char *(*send)(void *req, uint32_t query_size, uint8_t CAN_Address, uint32_t *block_len) a struct to hold generic functions like scanning for a port aka downstream connection, right?

Should it be possible to add a port afterwards, without reflashing? For now we decided that changing pins etc is not possible via the webconfig, this would require some more work to change.

The structure should ideally be constant to not consume precious RAM in constrained devices.

Just out of curiosity, what other device do you have in mind, this is very much tailored for the esp32?

b0661 commented 2 years ago

The concept behind evolved a little bit since I wrote that. It is going in the direction of a ThingSet device mesh or tree.

The idea is to exchange the active scan operation by a passive monitoring. Every ThingSet device shall therefor send a periodic. heartbeat statement. By this you do not have to issue a scan but can detect the devices by the heartbeat statement. This works also for devices that are attached on a CAN/ RS485/ RS232/ ... bus.

To make devices identifiable each device has it's own unique device id. The object path definition is extended by the device id.

As in a mesh or tree topology you have to route messages from one port to another. This should be done without any extra copying of data. Zephyr provides network buffers for that. In the concept struct ts_mesh_buf buffers are in fact Zephyr network buffers. Ports shall work on these buffers.

The port structure definition in the current concept:

/**
 * @brief A ThingSet communication port.
 *
 * Runtime port structure (in ROM) per port instance.
 */
struct ts_port {
    struct ts_port_info_elem *info;
    size_t info_count;

    int (*open)(const struct ts_port *port);

    int (*close)(const struct ts_port *port);

    /**
     * @brief Get transmission throughput.
     *
     * @return Throughput in bit/s.
     */
    uint32_t (*send_throughput_bit_s)(void);

    /**
     * @brief Receive message on port.
     *
     * @param[in] port Port to receive at.
     * @param[in] msg Pointer to message buffer to be used to receive.
     * @param[in] callback_on_received If callback is NULL receive returns on
     *                the next message. If the callback is set receive
     *                immediatedly returns and the callback is called on the
     *                reception of the message. Beware even in this case the
     *                callback may be called before the receive function
     *                returns.
     * @param[in] timeout_ms maximum time to wait in milliseconds.
     */
    int (*receive)(const struct ts_port *port, struct ts_mesh_buf *msg,
                   int (*callback_on_received)(const struct ts_port *port,
                                           const ts_device_id_t *hop_device_id,
                                           struct ts_mesh_buf *msg),
                   uint32_t timeout_ms);

    /**
     * @brief Transmit message on port.
     *
     * @param[in] port Port to send at.
     * @param[in] msg Pointer to message buffer to be send.
     * @param[in] hop_device_id Device ID of next hop to send the message to.
     * @param[in] callback_on_sent If callback is NULL send returns on the
     *                         next message. If the callback is set send
     *                         immediatedly returns and the callback is called
     *                         after the transmission of the message. Beware
     *                         even in this case the callback may be called
     *                         before the send function returns.
     * @param[in] timeout_ms maximum time to wait in milliseconds.
     */
    int (*transmit)(const struct ts_port *port,
                const ts_device_id_t *hop_device_id,
                struct ts_mesh_buf *msg,
                int (*callback_on_sent)(const struct ts_port *port,
                                        const ts_device_id_t *hop_device_id,
                                        struct ts_mesh_buf *msg),
                uint32_t timeout_ms);
};

Should it be possible to add a port afterwards, without reflashing? For now we decided that changing pins etc is not possible via the webconfig, this would require some more work to change.

No. I can imagine that a physical uC port (eg. serial) may become several predefined ports - e.g. RS 485, RS 232, I2C. These ports may be activated (open()) / deactivated (close()).

Just out of curiosity, what other device do you have in mind, this is very much tailored for the esp32?

It is a general concept. See https://github.com/LibreSolar/thingset-device-library/issues/13

Gretel5X commented 2 years ago

The idea is to exchange the active scan operation by a passive monitoring. Every ThingSet device shall therefor send a periodic. heartbeat statement.

I can totally see how that makes sense if the esp is used simply as a gateway, but if we want to display+configure connected devices we need to "know" what devices are connected, otherwise you constantly have to match ALL incoming heartbeats against the list of known devices or do you know a better way? How often is this heartbeat send, maybe the overhead is rather low...?

martinjaeger commented 2 years ago

Probably a heartbeat once every second would be enough. The updated CAN interface does already have a similar method to detect a new device. It listens to received publication messages and if it receives one from a device it doesn't have in its list it adds it to the devices array. The next time the web interface requests the device list, the ESP will ask the device with that CAN address for further information (like Device ID etc.).

Similar thing could be done on the serial. And if no message is received for e.g. 5 seconds the ESP could assume that the device is disconnected and remove it from the list.

b0661 commented 2 years ago

How often is this heartbeat send, maybe the overhead is rather low...? Probably a heartbeat once every second would be enough.

The lowest guessed throughput is that of LoRaWan with ~12 bytes/second. My rule of thumb would be 1% throughput for heartbeat. In the current concept a full fledged heartbeat message is about 20 bytes -> One heartbeat every three minutes.

Anyway the period is configurable and there may be intelligent methods to shrink the size of a single heartbeat statement, especially on low throughput links. The period information is part of the heartbeat message. The receiving device can adjust the timeout for loss of device to this given period.

In the concept the heartbeat is only send to direct neighbours. If the neighbor does have several ports it converts the heartbeat statement to a neighbour announce statement and passes it on to the other ports. I have to rethink that, maybe it is better to have a port throughput specific neighbour announce period to automatically adjust also the neighbour announce period to low throughput ports and at the same time allow for higher heartbeat rates at high throughput ports. Thank you for the question.

you constantly have to match ALL incoming heartbeats against the list of known devices or do you know a better way?

This is the way it works. There is a local device table that holds information about the known devices.

b0661 commented 2 years ago

@martinjaeger do you know a good way to discuss the ThingSet Mesh concept? I have a w.i.p. concept description and some very rudimentary code.

EDIT: Please see the actual ThingSet Mesh concept.

martinjaeger commented 2 years ago

Ok, lots of stuff to understand... didn't know the B.A.T.M.A.N. network before and I don't yet fully understand what it does from a quick look at the docs.

Anyway, some general questions/comments from my side already:

Would mesh networking not be part of ISO/OSI layers 2-3? So far I was considering ThingSet more an application layer protocol with recommendation for some lower layer protocols to integrate seamlessly (e.g. CAN ID layout or LoRaWAN ports). However, if we now put device addressing into the application layer, things may get quite complicated.
There seem to be very large differences in suitable heart beat message send periods... For CAN I was even thinking of something in the range of 100 ms in case the protocol is used for control of parallel power converters for example. For LoRaWAN we are talking about minutes to hours instead of milliseconds. Do we really need a heart beat message for LoRaWAN, given the really low bandwidth? The gateway will realize anyway if a node is still alive once it receives normal messages. Maybe heartbeats should only be sent if no payload is sent (for whatever reason).
The ID 0x00 is currently used as the root node in the firmware, which I think makes sense for device discovery. So it should probably be included in the list of fixed IDs and the heartbeat should get a different one (if required).
Small remark regarding CBOR: Only IDs up to 0x17 (23) and not up to 0x1F (31) are stored in a single byte.
LoRaWAN is specified as a star topology, as far as I know. Does it make sense to apply mesh networking on top of it? Most devices will not be listening anyway during normal operation, so they will also not receive any broadcast messages. Or did I get something wrong here and LoRaWAN was just not the best example for your more generic mesh networking ideas?

martinjaeger commented 2 years ago

Regarding good place to discuss the mesh network: Generally GH issues are probably OK. Alternatively we could use a wiki page on GitHub? Or open a dedicated repo to dump some markdown files with ideas?

b0661 commented 2 years ago

didn't know the B.A.T.M.A.N. network before and I don't yet fully understand what it does from a quick look at the docs.

It has the concept of throughput based routing. This is what I used as a starting point. There are a lot of other nice features but they are not really applicable to the low level mesh the concept is about. The concept is also for links that cannot or do not run Ethernet.

Would mesh networking not be part of ISO/OSI layers 2-3?

Sorry, I did not care for ISO/OSI layers. The primary focus is to have some man/machine issueing statements that are routed to the intended sink which is not on the same machine and maybe several hops way. The connection between these hops may be of different kind. In my case it is SPI and some proprietary bus. The number of devices in the mesh is assumed to be low (<= 100).

So far I was considering ThingSet more an application layer protocol with recommendation for some lower layer protocols to integrate seamlessly (e.g. CAN ID layout or LoRaWAN ports). However, if we now put device addressing into the application layer, things may get quite complicated.

The translation of a device id to a CAN ID is part of the CAN type mesh port and hidden behind the port API. If you want to make messages rout-able in a generic way you need an universal address scheme which is the device id in this case. There are shure other addressing schemes, but this one looks like it easily can be translated to the physical buses that are used and to IoT cloud protocols.

There seem to be very large differences in suitable heart beat message send periods... For CAN I was even thinking of something in the range of 100 ms in case the protocol is used for control of parallel power converters for example.

I think this is a dual use case. Your application is using the heartbeat statement for some safety reaction. The primary focus in the mesh is to keep topology information up to date. You may well use the same statement for different purposes. The heartbeat statement period is configurable. If your devices are running on the same CAN bus this should not be a problem. If your devices are some hops away there is currently a restriction in the concept (the rate of neighbour anouncements is limited to 1% of throughput to prevent congestion). So in this case you have to create your own high frequency safety heartbeat or the concept has to be altered.

For LoRaWAN we are talking about minutes to hours instead of milliseconds. Do we really need a heart beat message for LoRaWAN, given the really low bandwidth?

LoRaWAN was just an example of a low throughput link. I personally do not use LoRaWAN. You may well attach LoRaWan by a virtual port that acts like a gateway if this is the appropriate solution.

The gateway will realize anyway if a node is still alive once it receives normal messages. Maybe heartbeats should only be sent if no payload is sent (for whatever reason).

Heartbeat statements provide - besides heartbeat - throughput and update period to steer the routing in a mesh topology. If you have a static configuration this is only needed once. If your device jumps from one hop to another you may want to steer the messages to the correct hop it is currently attached to. This is mostly related to wireless connections.

You can and should configure the update rate according to the topology (change) needs.

The ID 0x00 is currently used as the root node in the firmware, which I think makes sense for device discovery. So it should probably be included in the list of fixed IDs and the heartbeat should get a different one (if required).

Shure, do you propose one?

Small remark regarding CBOR: Only IDs up to 0x17 (23) and not up to 0x1F (31) are stored in a single byte.

Thank you, I have to adapt the sequence count roll over.

LoRaWAN is specified as a star topology, as far as I know. Does it make sense to apply mesh networking on top of it?

It was taken as a low throughput example. Most probably all LoRaWAN devices will be mesh endnodes without routing capability.

Most devices will not be listening anyway during normal operation, so they will also not receive any broadcast messages. Or did I get something wrong here and LoRaWAN was just not the best example for your more generic mesh networking ideas?

You are right. LoRaWAN devices are a bad example for mesh routing (see above). They were just taken as the low end devices of link bandwidth.

Regarding good place to discuss the mesh network: Generally GH issues are probably OK. Alternatively we could use a wiki page on GitHub? Or open a dedicated repo to dump some markdown files with ideas?

Would you mind creating a 'mesh' branch on the thingset-device-library? This way also source code could be added and finally be tested with different applications. GH discussion issues could be linked to PRs.

b0661 commented 2 years ago

First ideas are now in https://github.com/b0661/thingset-device-library/tree/pr_mesh/src/mesh

martinjaeger commented 2 years ago

Sorry for the late reply, I was busy with lots of other stuff.

As you may have seen, I moved the device library repo to the ThingSet account on GitHub to make it more independent of Libre Solar. I've also created a mesh branch as you suggested. (GitHub will forward any requests to the old repo URL, so you will not get any failures for existing firmware with submodules / west configurations)

In addition to that, I updated the website with the specification. It's now available under https://thingset.io.

Now regarding the mesh part of the protocol: I'm wondering if this should be part of the existing library or if it should be kept as a separate extension:

I'm afraid to make the protocol and the library too difficult to understand. ThingSet is really meant to be simple to understand and use.
Storing the routing tables of all other devices can be difficult for devices with little RAM. Or would this only be required for mesh gateways?
I'm still not fully convinced that message routing should be part of an application layer protocol. I feel that it violates the network protocol layer structure.
Related to that: Do the device IDs have to be globally unique for the mesh protocol? If we want to access a device via multiple hops independent of where it is connected, I guess this would be the case. So should we use UUIDs instead of a string with 8 bytes as it is currently done? This could also be difficult for very low-bandwidth networks. And would we not be re-inventing IPv6?

I'm wondering if we could not use an MQTT broker. Every device communicates with the broker (independent of lower layer transport), so it's possible to exchange messages between all different devices. However, it's not decentralized and not local (w/o internet access) anymore.

In general: In my understanding a mesh is something where devices in a network can directly communicate to each other, potentially via multiple paths. Is that really what you are envisioning? Should also simple devices like sensors interact directly with other sensors? Or do you more think of a star-of-stars topology like LoRaWAN? In that case only gateways would need to store the routing table, which would make more sense for IoT applications in my opinion.

BTW: Do you know DDS? It seems to go in a similar direction.

martinjaeger commented 2 years ago

Just discovered this quite interesting project from Eclipse Foundation: https://zenoh.io/ https://www.youtube.com/watch?v=_wAdFHrESY0&ab_channel=EclipseFoundation

Maybe the zenoh.net layer could be leveraged for the routing of ThingSet messages... but I didn't fully understand how their line protocol actually works. I can only find documentation of higher-level APIs for different programming languages.

There is also a Zephyr library already: zenoh-pico.

b0661 commented 2 years ago

A lot of questions - some ideas:

Now regarding the mesh part of the protocol: I'm wondering if this should be part of the existing library or if it should be kept as a separate extension:

It should be part of the existing library, but be activated by a configuration switch (Kconfig in the case of Zephyr). This way the protocol and the mesh protocol extension can stay in sync more easily.

I'm afraid to make the protocol and the library too difficult to understand. ThingSet is really meant to be simple to understand and use.

See above. ThingSet Mesh is an extension that has to be activated. The ThingSet library shall be usable without it. A simple request/ response on a single link can always be done without mesh functionality.

I'm still not fully convinced that message routing should be part of an application layer protocol. I feel that it violates the network protocol layer structure.

ThingSet Mesh is in between the application (which does know nothing about the mesh topology) and the data link layer abstracted by the mesh ports. IMHO it is not a an application layer protocol. The application has to provide source/ destination information - but is this really the criterium for an application layer protocol?

Do the device IDs have to be globally unique for the mesh protocol?

That is the idea. I already switched to uint64_t device ids in the concept. Maybe this is over-engineered, but a switch to uint32_t or even shorter should be easy.

If we want to access a device via multiple hops independent of where it is connected, I guess this would be the case. So should we use UUIDs instead of a string with 8 bytes as it is currently done?

Yes

This could also be difficult for very low-bandwidth networks.

As you mentioned in one of the other comments very low-bandwith networks may be better connected by dedicated gateway ports instead of being a direct node in the mesh. Such a gateway can provide address translation.

And would we not be re-inventing IPv6?

IPv6, 6Lowpan, ... work on ethernet frames. ThingSet Mesh directly works on the specific data link protocol like CAN, RS232, ... as used by ThingSet. So one could state it works on ThingSet frames as defined for the specific data link. I would call it re-using concepts already available.

I'm wondering if we could not use an MQTT broker. Every device communicates with the broker (independent of lower layer transport), so it's possible to exchange messages between all different devices. However, it's not decentralized and not local (w/o internet access) anymore.

In my use case I want to route messages from/ to devices that are within multi hop distance without an internet connection. There may be several originators of requests at the same time. This can be done by a local MQTT broker on one of the devices which creates a star topology for messaging. This may reduce the routing table size for devices but may also create a lot more hops for messages to travel. It introduces the complexity of the local MQTT broker and some problems when the connection to the local MQTT broker is broken and you have to bring up a new broker for the newly created subnet.

In my understanding a mesh is something where devices in a network can directly communicate to each other, potentially via multiple paths. Is that really what you are envisioning?

Yes and no. I´m expecting most of the ThingSet Mesh topologies used to be simple with having only one path and a very limited number of hops.

Should also simple devices like sensors interact directly with other sensors?

This question is more about application than about mesh functionality. From the mesh side sensor devices usually have only one mesh port. These one port devices do not need to implement the full mesh routing capabilities. The routing table may even be reduced to a single default router entry - I am currently looking for a possibility to detect this automatically. Such a one port device could use a stripped down library to save memory.

Or do you more think of a star-of-stars topology like LoRaWAN? In that case only gateways would need to store the routing table, which would make more sense for IoT applications in my opinion.

I´m thinking of three categories of nodes:

One port simple node; e.g. sensors
Multi port router node; e.g. sensors with additional routing capabilities, pure routers
One port gateway node; e.g. gateway to other worlds like LoRaWan or MQTT or ...

Router nodes hold a routing table. One port nodes may have a simplified routing table or none at all if they only issue statements (which are broadcasts anyway).

BTW: Do you know DDS? It seems to go in a similar direction.

I do not know DDS by detail. But as far as I can see it is about data distribution not about how to link nodes. So it may be used on top of a mesh infrastructure but does not provide it.

Maybe the zenoh.net layer could be leveraged for the routing of ThingSet messages ...

The net layer still seems to expect that there is a network available. I could not find the source of the zenoh-router. It seems it can work on IP layer 2 as for example Batman Adv. So an ethernet network is necessary.

In contrast the ThingSet Mesh builds the network using CAN, RS232, ... data links without ethernet. The zenoh.net router principles could be interesting - but without source it looks like these semi open industrial projects. I am burned by such kind of projects and do not want to use more than concepts from them.

martinjaeger commented 2 years ago

Thanks for the explanations. I think I understand much better what you'd like to do now. And I agree it makes sense to have a simpler layer for CAN, RS232 instead of Ethernet.

For the library (maybe we should move discussion over there) I think we need those three modules (each can be switched on and off as you suggested):

Device side to receive requests and serve data (that's what's currently implemented in ThingSet/thingset-device-library)
Client side to send out requests (currently implemented in this repository for the specific ESP32 application)
Router (connecting above two modules for packet forwarding, etc.)

The port abstractions should ideally not only include the lower level protocols, but also the "other worlds" like MQTT or at least provide interfaces for them.

Regarding zenoh.net: Eclipse foundation sounded quite open to me, but it's somewhat strange that the protocol itself, which is the most important part from my perspective, is not documented, but you have to look it up in the code of the provided tools/libraries. Maybe I just didn't find it, so I raised an issue and asked about further documentation.

LibreSolar / esp32-edge-firmware

Introduce ports to handle tasks like device scan in a generic way #29

Downstream (device side)

Upstream (host/cloud side)