Open m8pple opened 3 years ago
Looking back at my ToDo list for the Composer, this is on there but I missed it. Good spot.
My notes say that I was going to put it around all of the message types and the global properties. After the earlier discussions, I will add it around all of the device/pin properties and state as well.
Fixing this, I have run into an instruction space explosion when packing everything to the point that the canonical heated plate overflows the available instruction space when compiled with default options.
Size of Text for different packing added (O2): Existing packing (sporadic, only in the poets_pht.h) 6980 Packing Message structs in MessageFormats.h 7580 Packing Global Properties 7052 Packing device properties/state Overflow by 612 bytes All Overflow by 1208 bytes
Size of Text for different packing added (Os): Existing packing (sporadic, only in the poets_pht.h) 6584 Packing Message structs in MessageFormats.h 7036 Packing Global Properties 6656 Packing device properties/state Overflow by 96 bytes All Overflow by 616 bytes
Making without the Trivial log handler, I can get it to compile with Os for a total text size of 8264. O2 still overflows by 36 bytes.
Stripping the trivial log handler for Os takes us down to a text size of 5592 with no changes to packing.
Blowing 2672 bytes (about a third of the instruction space) for unaligned reads, especially when the example uses nothing but 32-bit wide variables, seems a bit off. I will have to dig further.
After chatting with @DrongoTheDog and @mvousden yesterday, I have been digging into the instruction space explosion a little more today.
I have a fairly simple struct of 4-byte wide values:
typedef struct cell_state_t
{
// Line 82
float t = 0.0;// Our temperature
uint32_t updated = 0;
uint32_t fin = 0;// Finished Flag
uint32_t finSent = 0;// Finished Sent Flag
uint32_t finIdx = 0;// Finished Index
float tFixed = 0.0;
float tData[4][2] = {{0.0,0.0},{0.0,0.0},{0.0,0.0},{0.0,0.0}};
uint32_t stepRX[2] = {0,0};// Bitmask for the inputs we have received each step
uint32_t connected = 0;// Bitmask of connected pins, populated in Init
uint32_t activeCount = 4;// Convenience value for number of connected devices to save recalcs
uint32_t iter = 0;// Our iteration count
uint32_t idleCnt = 0;
} devtyp_cell_state_t;
and a simple copier:
void Update(devtyp_cell_state_t* a, devtyp_cell_state_t* b)
{
a->t = b->t;
a->updated = b->updated;
a->fin = b->fin;
a->finSent = b->finSent;
a->finIdx = b->finIdx;
a->tFixed = b->tFixed;
a->connected = b->connected;
a->activeCount = b->activeCount;
a->iter = b->iter;
a->idleCnt = b->idleCnt;
a->stepRX[0] = b->stepRX[0];
a->stepRX[1] = b->stepRX[1];
for(int i = 0; i<4; i++)
{
a->tData[i][0] = b->tData[i][0];
a->tData[i][1] = b->tData[i][1];
}
}
unpacked.cpp uses the default packing and packed.cpp uses `#pragma pack(push,1) ... #pragma pack(pop)" around the struct.
When built with no optimisations with RISCV GCC 8.2.0 (the version we use by default), the resulting unpacked binary ends up with a text section of 1132 bytes and the packed binary ends up with a text section of 3332 bytes. Assembly dumps from GCC: unpacked_8.2.0.s and packed_8.2.0.s. Compiling with RISCV GCC 10.2.0 ends up with the same pattern, but 28 bytes smaller text size.
Compiling with O2 (our default Softswitch optimisation) drops these sizes to 212 bytes and 1072 bytes respectively: unpacked_8.2.0_O2.s and packed_8.2.0_O2.s
Well, I don't think this is actually the problem you're after, but in this particular case you could replace the whole Update thing with
inline void Update(devtyp_cell_state_t a, devtyp_cell_state_t b) { a=b; }
which will save a whole ton of unpacking, copying and re-packing. (The compiler will generate the copier, based probably on a memcopy of some sort.)
ADB
On 18/06/2021 21:54, Graeme Bragg wrote:
CAUTION: This e-mail originated outside the University of Southampton.
After chatting with @DrongoTheDog https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDrongoTheDog&data=04%7C01%7Cadb%40soton.ac.uk%7C954c279d37e4484f6fd808d9329b3477%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637596464441125142%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BKHDbrjAM4lv5cnF1uHQuSIXzvt7AXLx0roBitiEoeY%3D&reserved=0 and @mvousden https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmvousden&data=04%7C01%7Cadb%40soton.ac.uk%7C954c279d37e4484f6fd808d9329b3477%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637596464441125142%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WOjkc4%2B5iqrxzjSPC2LCl2gLXZAZQyhCiVX2dXV7x4o%3D&reserved=0 yesterday, I have been digging into the instruction space explosion a little more today.
I have a fairly simple struct of 4-byte wide values:
|typedef struct cell_state_t { // Line 82 float t = 0.0;// Our temperature uint32_t updated = 0; uint32_t fin = 0;// Finished Flag uint32_t finSent = 0;// Finished Sent Flag uint32_t finIdx = 0;// Finished Index float tFixed = 0.0; float tData[4][2] = {{0.0,0.0},{0.0,0.0},{0.0,0.0},{0.0,0.0}}; uint32_t stepRX[2] = {0,0};// Bitmask for the inputs we have received each step uint32_t connected = 0;// Bitmask of connected pins, populated in Init uint32_t activeCount = 4;// Convenience value for number of connected devices to save recalcs uint32_t iter = 0;// Our iteration count uint32_t idleCnt = 0; } devtyp_cell_state_t; |
and a simple copier:
|void Update(devtyp_cell_state_t a, devtyp_cell_state_t b) { a->t = b->t; a->updated = b->updated; a->fin = b->fin; a->finSent = b->finSent; a->finIdx = b->finIdx; a->tFixed = b->tFixed; a->connected = b->connected; a->activeCount = b->activeCount; a->iter = b->iter; a->idleCnt = b->idleCnt; a->stepRX[0] = b->stepRX[0]; a->stepRX[1] = b->stepRX[1]; for(int i = 0; i<4; i++) { a->tData[i][0] = b->tData[i][0]; a->tData[i][1] = b->tData[i][1]; } } |
unpacked.cpp https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPOETSII%2FOrchestrator%2Ffiles%2F6679142%2Funpacked.cpp.txt&data=04%7C01%7Cadb%40soton.ac.uk%7C954c279d37e4484f6fd808d9329b3477%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637596464441135141%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nedwwaWvL18RMxTqkXpwp%2BhYsNBUBNg3pAZCB5LRUeA%3D&reserved=0 uses the default packing and packed.cpp https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPOETSII%2FOrchestrator%2Ffiles%2F6679139%2Fpacked.cpp.txt&data=04%7C01%7Cadb%40soton.ac.uk%7C954c279d37e4484f6fd808d9329b3477%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637596464441135141%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PUnLnz8nur39I3kqGLWOqFMzcSw2P6GbbP%2BB2xwTbcg%3D&reserved=0 uses `#pragma pack(push,1) ... #pragma pack(pop)" around the struct.
When built with no optimisations with RISCV GCC 8.2.0 (the version we use by default), the resulting unpacked binary ends up with a text section of 1132 bytes and the packed binary ends up with a text section of 3332 bytes. Assembly dumps from GCC: unpacked_8.2.0.s https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPOETSII%2FOrchestrator%2Ffiles%2F6679264%2Funpacked_8.2.0.s.txt&data=04%7C01%7Cadb%40soton.ac.uk%7C954c279d37e4484f6fd808d9329b3477%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637596464441145134%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=64P%2F3YQOxP%2BFrfZbghM994dOhFO15jQxtAzJMaSyyk8%3D&reserved=0 and packed_8.2.0.s https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPOETSII%2FOrchestrator%2Ffiles%2F6679267%2Fpacked_8.2.0.s.txt&data=04%7C01%7Cadb%40soton.ac.uk%7C954c279d37e4484f6fd808d9329b3477%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637596464441145134%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OfvteqR0lgyCyBPGHtsQHVUsuuhbZ5W0XIUxIro%2BNfQ%3D&reserved=0. Compiling with RISCV GCC 10.2.0 ends up with the same pattern, but 28 bytes smaller text size.
Compiling with O2 (our default Softswitch optimisation) drops these sizes to 212 bytes and 1072 bytes respectively: unpacked_8.2.0_O2.s https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPOETSII%2FOrchestrator%2Ffiles%2F6679313%2Funpacked_8.2.0_O2.s.txt&data=04%7C01%7Cadb%40soton.ac.uk%7C954c279d37e4484f6fd808d9329b3477%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637596464441155122%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5n4fhFW5RyoUEa%2FH3D1clb3tnQLYSYZfMfZCyNEJmUE%3D&reserved=0 packed_8.2.0_O2.s https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPOETSII%2FOrchestrator%2Ffiles%2F6679314%2Fpacked_8.2.0_O2.s.txt&data=04%7C01%7Cadb%40soton.ac.uk%7C954c279d37e4484f6fd808d9329b3477%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637596464441155122%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6Gjf5YGAxQEzNzEeIfE0tCLFfBPxkNIJqBgEUzADG%2BQ%3D&reserved=0
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPOETSII%2FOrchestrator%2Fissues%2F240%23issuecomment-864268638&data=04%7C01%7Cadb%40soton.ac.uk%7C954c279d37e4484f6fd808d9329b3477%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637596464441155122%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=TFhjZUe7F%2FokVZ9J0zA94vK8ubzoZ5ockjCZKd%2FYzf0%3D&reserved=0, or unsubscribe https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE7EYI6SER3YDTUYSPJWWM3TTOW6TANCNFSM46VENHYQ&data=04%7C01%7Cadb%40soton.ac.uk%7C954c279d37e4484f6fd808d9329b3477%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637596464441165115%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fyxyC4QboQFQzNPhMbAaKB0hdK0wyZknK%2BFkBnsp%2Bsg%3D&reserved=0.
Andrew Brown Professor of Electronics University of Southampton Hampshire SO17 1BJ UK
T: 02380 593374 M: 07464 981171
which will save a whole ton of unpacking, copying and re-packing. (The compiler will generate the copier, based probably on a memcopy of some sort.)
while that may help with this toy example (which is just trying to show the explosion), it won't really help in the general case for device code.
Devices generally don't copy large swathes of state around - individual members are accessed piecemeal for a variety of different handlers.
Way forward - pack 1 the message formats. Do not pack 1 the rest of it.
The types generated by the orchestrator do not have any explicit packing specification.
This is required by the v4 (and earlier, I think) specs, as otherwise it is impossible to do cross-platform movement of data-types, or to reliably map between raw bytes and structs:
https://github.com/POETSII/poets_improvement_proposals/blob/2a355282f3aa3912bdacf4ec50bf25199bdf310d/proposed/PIP-0020/virtual-graph-schema-v4.rnc#L768
I know that the orchestrator relies on gcc to handle this, but other implementations emit structures directly as binary (as it is much faster), and so need to have exact mappings. (To be fair, they also control the risc-v, GPU, or HLS tool, so they can work around it, but it still makes it easier).
Given the need to move structs between architectures within a system, and between systems with different architectures, there is no expectation that the C compilers will choose to lay structures out with the same alignment -
Implementation of externals (in the IO sense) is also impossible without known packing, as otherwise you have no idea how the device compiler has aligned things.
You also can't reliably move structs through bytes, as you have no idea how the compiler has decided to layout the structs. Your only choice is to move member by member, or copy each member into bytes, which is much slower.
History
At the v3 and v4 discussions I think this came up, and the solutions were:
pack(1)
and get users to pack.So we are left with option 1.
Problem
Compile the following code on different systems: