Open bradleyharden opened 5 years ago
Looks interesting! The best way forward would be to take small value adding steps and see where this goes.
For the first step I would suggest that you create type_t
as a record containing an id and then a new_type_t
function. Once you have that you can replace the queue element type enums with
constant name_of_type : type_t := new_type_t("name_of_type");
To start with there is really no need for anything but a name property. Should consider making the ID a string to avoid some run-time encoding. At this point there is no difference between type_t
and msg_type_t
and maybe it should be the same thing in the end. A message type is just another type.
The second step would be to define an element_t
, As you noted the unconstrained string would be problematic with VHDL-93 but does it really help to have a record with several encoded fields? Why not make it one long string.
In fact, that brings up a separate point. Why is queue_t implemented with a single, variable length string? Why not implement it as a variable length integer_vector and treat each integer as a string_ptr to an element? Then you never have to use push_variable_string and you also know the how many elements are in the queue at any moment. To me, that seems like a much better way to implement it. Is there some reason the current implementation was chosen instead?
I'm not sure but @kraigher probably knows. It may have to do with memory performance/utilization. One large string vs many small fragmented strings.
Anyway, element_t is great for pushing things around internally, using unconstrained function inputs and return values, but it isn't easy for users at the top level.
If element_t
is just used internally it's not much value in doing the refactoring. We are just moving things around but also create a whole new set of to/from_element
subprograms for every type.
There, I think it makes more sense to give them
type element_ptr_t is record p_type : type_t; p_range : string_ptr; p_code : string_ptr; end record;
I'm not sure whether to use separate string pointers for the range and code here, or to combine them into a single string_ptr. Either way, this record is of fixed size and can be easily manipulated at the top level.
With the range being part of the code today it's probably easier to keep them together. If so, why not keep all together?
With element_t and element_ptr_t you could perform type conversions of the underlying coded data. To facilitate that, I would like to change the encoding of all base types to use the standard binary representation. For example, instead of using 'T' and 'F' for a boolean, use character'val(0) and character'val(1) That way, you could easily convert between boolean and a unit8 type. Or you could convert a real type into an array of 4 uint16 types. To accomplish this with real, we could use the to_real function from float_pkg, rather than the custom implementation right now.
The true/false values comes from a time when we also supported humanly readable codecs for debugging. That could be changed. The real
implementation is a workaround for some simulator if I remember correctly. Do we really want to encourage type casting?
Next, you can define array_t as an array of element_t, where array_t stores each element internally as a pointer to an encoded string. In effect, element_t acts as the container that @kraigher mentioned one time on Gitter when discussing arrays of arbitrary type.
If the element is in itself an encoded string this would be easy.
Finally, you could define
type dynamic_t is record p_type : type_t; p_array : array_t; end record; Using this, users could dynamically create a new type_t and append elements to the dynamic_t to define their own containers for custom records.
Also, dict_t could be modified to accept element_t and dynamic_t as keys and values. You could also create a list_t based on array_t with a focus on list operations rather than a general array type, like integer_array_t. I think there are many different directions you could take this.
Yes, there should be many possibilities. All standard data structures should have a generic variant.
Separately, I also have an idea for a bit_array_t. Its purpose would be to store elements of arbitrary bit width. Internally, the data would be stored as one large string. Each bit_array would store a range_t that indicates the size of each element. The get and set functions would pull out a sub string and decode it to set/get the desired bits. You could even change the range_t dynamically to change your view of the bit_array. This is similar to read_word and write_word in memory_t, but not restricted to 8-bit boundaries. @umarcor might be interested in this as the basis for co-simulation. I think it would be the most flexible option.
I'll have to dig more into what @umarcor have done to understand the implications. I let him respond to this
The second step would be to define an element_t, As you noted the unconstrained string would be problematic with VHDL-93 but does it really help to have a record with several encoded fields? Why not make it one long string.
My original thought was that there would be a problem defining ownership. If element_t
uses a string_ptr
for the code, then who owns it when elements are created internally to VUnit, especially when pushing and popping? However, I think this might be solvable by carefully structuring functions and procedures to guarantee there is only one owner.
With the range being part of the code today it's probably easier to keep them together. If so, why not keep all together?
Part of my motivation here is to facilitate bit_array_t
. By separating the range from the code, you can generate different views of the data. And more generally, the range and the code are separate entities to me, even if they are closely related. It might make sense to store them together, but we should still have functions get_range
and get_code
to pull them out separately, if required.
I've actually already completed and tested a refactoring of codec_pkg
and friends in this regard. I created new encode
procedures in codec_builder_pkg
, and I migrated all of the encoding algorithms there. I also changed all the decode
procedures in codec_builder
to only decode the actual data, not the range_t
. Decoding the range_t
is now handled by the encode
and decode
functions in codec_pkg
.
One of my main motivations here was to bring symmetry to the structure of the two packages and to keep the algorithms for encoding and decoding a data type adjacent to each other in the same package. It was a real pain when you had to flip back and forth between packages to see how something was encoded and decoded. I also took the opportunity to clean up some of the algorithms. For example, I greatly simplified the algorithms for std_ulogic_array
.
Do we really want to encourage type casting?
I don't want to encourage it, but I think all tools are valuable if used appropriately. I'm sure there are some valid use cases. However, it would be reasonable to issue strong warnings about it in the documentation.
I'll put together a useful addtion and make a pull request.
Also, you didn't answer one of my Gitter questions. Some of my changes to codec_pkg
and codec_builder_pkg
broke the code generation tools for the old com
library. Is that ok? Can we remove those now?
I'm not sure but @kraigher probably knows. It may have to do with memory performance/utilization. One large string vs many small fragmented strings.
This is coherent with some comments in the sources of string_ptr_pk and integer_vector_pkg, where the size of the allocated buffers is increased 'speculatively' to avoid too many allocations.
This is similar to read_word and write_word in memory_t, but not restricted to 8-bit boundaries. @umarcor might be interested in this as the basis for co-simulation. I think it would be the most flexible option.
I'll have to dig more into what @umarcor have done to understand the implications. I let him respond to this
I must admit I'm a bit overwhelmed with this issue. I have not dig into queue and codec sources yet, so I'm blind. Anyway, I'll do my best:
Currently, string_ptr
and integer_vector_ptr
are implemented as a pointer to an array of pointers to the type (character
or integer
). A variable is used to track the number of allocated elements. My proposal (#507) is to add a level of indexing: use a record that contains three pointers to arrays (of pointers), and three variables to track the number of allocated elements:
type ptr_storage is record
id : integer;
ptr : integer;
eptr : integer;
ids : storage_vector_access_t;
ptrs : vava_t;
eptrs : evava_t;
end record;
ids
contains an element for each allocated vector, no matter the storage mode.ptrs
contains the pointers to internal arrays (just as in master).eptrs
contains pointers to arrays that are cast/mapped to external variables (in C).This is all built on top of a new type, type storage_mode_t is (internal, extfnc, extacc);
, which is also a new option/parameter added to new_string_ptr
and new_integer_vector_ptr
.
When a new vector is requested, a new element is added to ids
and id
is incremented. If it is of type internal
or extacc
, the other vectors/variables are updated accordingly.
Then, for each get, set, length, resize, etc. a 'case when' is used to do up to three different actions (depending on the type).
NOTE: ptrs and eptrs are fundamentally the same. The difference is who 'calls malloc', VHDL/GHDL or C/Python. Unfortunately, explicit separate types are required. See ghdl/ghdl#797.
NOTE: allocation of type
extfnc
are only identified by an id. No array type is managed at all. Instead, resources are used throughread_char
,write_char
,read_integer
,write_integer
(which are functions expected to be defined in C/Ada, but that can be implemented in VHDL). This is meant to allow the user to implement any custom function, which might even connect to some remote board/server/device.
This is implemented in #507 for string_ptr
(and byte_vector_ptr
as an alias) and in #476 for integer_vector_ptr
. However, @bradleyharden, I think that none of these issue conflict with the changes you propose. These are lower in the abstraction stack. Since the mode defaults to internal
, you should find absolutely no functional difference. Nevertheless, I think it is important for you to know this context.
Then comes #470, which tries to extend the mechanism to memory_t in verification_components. As a matter of fact, this is the feature that motivated the series of PRs (see #462). This is where the fun begins!
NOTE: the code proposal in #470 is not valid yet.
Please see this comment by @kraigher, where he explains some very basic details of the memory model which I didn't know at the moment. Then, from my last comment there:
- Data and metadata in VHDL.
- Data in an external byte array and metadata in VHDL.
- Data in VHDL and metadata in an external integer array.
- Data and metada in an external integer array.
Hence, I believe that these same possibilities might apply for any other type that is built on top of either string_ptr or integer_vector_ptr. Each time you allocate one of them, you can decide to do it internally, as an external access or to be used through functions.
However, I don't know when is queue_t used, and I think I'm mixing it with this comment. Is it possible that when @kraigher said that verification components interact with the memory model, he was referring to memory-mapped verification components? I.e. stream verification components use queue only and do not interact with memory at all. If so, I would say that queue 'must' support external models to be coherent with memory; so that verification components of any kind can be externally loaded.
Precisely, I use AXI Stream VCs quite a lot. When data is allocated externally, I have a simple for loop in the testbench that pushes it to an AXI Stream Master, and another loop that pulls results from an AXI Stream Slave. This is not a problem due to the streaming nature of the flow. I.e, I can write/read a value per clock cycle, because I know that they are sequential. With a (RAM) memory, I need to copy everything before the actual testbench starts.
Nonetheless, after this discussion, I believe that copying data from an external buffer to an AXI Stream verification component is not required. It should be possible to tell the verification component that data is already allocated and how many data elements are available.
Regarding the 'codified dynamic type system', I'm getting lost with the 'codified' term. For me, such a feature should allow the user to define a record type in VHDL and an equivalent struct in C. Then, allocate an array of size number_of_elements*bytes_per_struct bytes, and have C access it through casting and GHDL use it as the 'memory' of a queue. I feel that you have tried to answer and explain this explicitly in the first comment, but I am not being able to understand how do you 'cast' an string(1 to 9)
to a custom record type in VHDL.
Last, but not least, the external models for string_ptr
and integer_vector_ptr
are almost ready, but there are some flaws to solve yet. The most important one is that the length is provided when the vector is created and it cannot be modified for external models; Resize, reallocate and deallocate are not implemented for external models. With models of type extacc, it is technically easy for either VHDL or C to do the reallocation, resizing, etc. However, with extfnc, it can be impossible.
As a side but related note, I think that currently vectors are given an index when they are allocated and the index is never reused, even if the vector is deallocated. As a result, the user is 'responsible' of allocating a sensible number of pointers and reusing the identifiers for an optimal solution. I believe it could be an interesting enhancement to keep track of the deallocated indexes without adding other three vectors to ptr_storage
.
EDIT
Or you could convert a real type into an array of 4 uint16 types.
My current proposal is to only implement string_ptr (1 byte) and integer_vector_ptr (4 bytes), because that's already quite a lot of code duplication (4 files are almost identical). However, if you find it useful, it can be implemented for a type of 8 bytes too: https://ghdl.readthedocs.io/en/latest/using/Foreign.html#restrictions-on-foreign-declarations
@bradleyharden
Part of my motivation here is to facilitate bit_array_t. By separating the range from the code, you can generate different views of the data. And more generally, the range and the code are separate entities to me, even if they are closely related. It might make sense to store them together, but we should still have functions get_range and get_code to pull them out separately, if required.
Let's see how it turns out. In this case there are no VHDL-93 issues
I'll put together a useful addtion and make a pull request.
Great
Also, you didn't answer one of my Gitter questions. Some of my changes to codec_pkg and codec_builder_pkg broke the code generation tools for the old com library. Is that ok? Can we remove those now?
Yes, we've supported it long enough. I think we should make a commit where all deprecated stuff in com
, logging
, check
, and run
is removed.
@LarsAsplund, I remembered the issue that motivated my distinction between element_t
and element_ptr_t
.
I think it would be elegant and unifying to use some version of element_t
as the fundamental unit that queue_pkg
operates on. However, queue_pkg
always pushes the basic types by value. But if element_t
is defined in terms of a string_ptr
, pushing by value becomes awkward and will probably have poor performance. Separating element_t
and element_ptr_t
solves this issue and provides an elegant implementation of queue_pkg
Suppose a user wants to push an integer and that queue_pkg
only operates on instances of element_t
. In that case, push_integer
would have to create a new, dynamically allocated element_t
. The element would be passed to an internal procedure of queue_pkg
, where the string would be copied into the queue. Then push_integer
would deallocate the element. pop_integer
would have to do the same. Thus, for a single push and pop, you have two allocations and deallocations.
There are a few VHDL-93-compatible alternatives here, but I don't think either is good.
element_t
by reference. Then push_integer
would allocate an element and pop_integer
would deallocate it. That saves you one allocation cycle, but it's still not great. **See note below on this possibilityqueue_pkg
as it is currently defined, where each overloaded function handles encoding and decoding itself. This will work, but it repeats a lot of code and misses an opportunity to unify the type system.With VHDL-2008, this is a non-issue. element_t
would be based on string
rather than string_ptr
. All of the push
functions could create new elements without a new allocation, and the problem is solved.
** @umarcor I'm not totally sure that queue_t
was implemented as a single string to avoid too many allocations. The comment in queue_pkg
refers to copying. If every new allocation requires that you copy all the existing data, then you don't want to allocate too often. However, the problem would be much different if queue_t
were implemented as an array of integers representing string_ptr
. Each element would require an allocation, so the total number of allocations would probably go up, but the allocation of each element doesn't require copying. More allocations is a disadvantage, but there would also be a benefit. The queue itself would have to be resized less often, because there is only one integer per element, rather than many characters per element.
If we were to change the implementation of queue_t
, then the pass-by-reference alternative above would actually fit well. You could define element_t
as a single string_ptr
with functions to extract the type_t
, range_t
etc. When pushing an element to a queue, you would just set the next queue string_ptr
equal to the element string_ptr
. As far as I can see, this is the only elegant way to do it in VHDL-93. @LarsAsplund, do you have any thoughts on this approach? I realize it's pretty drastic. But at the same time, it wouldn't change any interfaces.
I just took a deeper look at queue_pkg
and I think there is room for improvement in the implementation. Right now, the string buffer is linear. When trying to push would overflow the buffer, a new, larger buffer is allocated and old data is dropped.
Changing the implementation to use a circular buffer would cut down on the number of times you need to resize. There would be a little more bookkeeping internally, but I think that would be minor relative to the benefits of a circular buffer. It would probably also make sense to stick to power-of-two buffer sizes.
I might try to re-implement queue_t
myself, with the alternative mentioned above, and see what kind of performance I get.
I had another realization about the current implementation of queue_t
. When a queue is initialized, the string_ptr
length is 0. Here is the line where the queue is resized.
resize(queue.data, 2 * length(queue) + 1, drop => head);
Suppose a 24-bit std_ulogic_vector
is the first element pushed to the queue. The coded form has four fields, a 1-byte type, a 4-byte integer for the string length, a 9-byte range, and 12 bytes of data. That's 26 total bytes.
To push that string to the queue, the current implementation has to resize itself five times, from 0 to 1, 3, 7, 15, and finally 31 characters. It gets better once the string is of a reasonable length, but it has to reallocate and copy itself several times to bootstrap to a reasonable length. Just initializing the string to some reasonable length could increase performance.
I'd like to get @kraigher's thoughts on this when he returns.
Sorry for the spam, but I want to jot these ideas down while I have them.
We can extend the reformulation of queue_t
further. Define element_t
as a string_ptr
, then define array_t
as a dynamic array of element_t
. You can then base both queue_t
and dynamic_t
on array_t
. The former is used when you want to consume and deallocate the elements as you go. The latter is used when you want to read the values multiple times. Maybe dynamic_t
should be named container_t
? Either way, you could have a function to cast a queue_t
as a container_t
and vice versa. Then you could change msg_t
to accept and provide either format. That way you could get a message, convert its data to container_t
and iterate over the data multiple times without having to store it yourself. It would also allow you to see how many elements are in a message before you use the data.
This change would also require that we modify the payload
element of message_t
, but I don't think it would be a big deal.
@LarsAsplund, I now have a working implementation of queue_t
that uses an array of string_ptr_t
rather than a single string
. It also uses a circular buffer, rather than a linear buffer. Furthermore, I updated the implementation of string_ptr_pkg
to keep track of deallocated reference indices and reuse them. I wanted to minimize the number of times that ptrs
, the string_access_vector
, is resized, since there will be a sharp increase in the number of string_ptr_t
allocations with this new strategy. I plan to do some performance tests comparing my implementation with the current one. I'll post the results.
@bradleyharden, did you do those changes on top of #507 or independent to them?
No, I did not consider it. It wouldn't be hard to merge them though. All I did is add an integer_vector to use as a stack for old pointer reference indices.
The changes are just meant to get me to the point where I can test the performance difference between implementations of queue_t. We'll see what happens after that.
@LarsAsplund and @kraigher, I have some preliminary performance tests of my new implementation of queue_t
.
My version of queue_t
uses an integer_vector_ptr
for the data, where each integer is a string_ptr
to a pushed element. I also used a circular buffer, rather than a linear buffer like the current implementation. I think my implementation has a number of advantages:
length
of a queue is now equal to the number of elements contained within itstring_ptr
So far, it looks like my implementation has better performance overall.
I've run two types of tests below. Here is an example of the "Push then pop" test
for i in 0 to 2**16 - 1 loop
push(queue, 0);
end loop;
for i in 0 to 2**16 - 1 loop
int := pop(queue);
end loop;
And here is an example of the "Push/pop" test
for i in 0 to 2**16 - 1 loop
push(queue, 0);
int := pop(queue);
end loop;
Here are the results for the master
branch
==== Summary ======================================================================
pass vunit_lib.tb_queue_performance.Push then pop 2^16 integers (0.5 seconds)
pass vunit_lib.tb_queue_performance.Push then pop 2^18 integers (1.3 seconds)
pass vunit_lib.tb_queue_performance.Push then pop 2^20 integers (4.7 seconds)
pass vunit_lib.tb_queue_performance.Push then pop 2^21 integers (9.2 seconds)
pass vunit_lib.tb_queue_performance.Push then pop 2^22 integers (18.2 seconds)
pass vunit_lib.tb_queue_performance.Push/pop 2^16 integers (0.5 seconds)
pass vunit_lib.tb_queue_performance.Push/pop 2^18 integers (1.4 seconds)
pass vunit_lib.tb_queue_performance.Push/pop 2^20 integers (5.2 seconds)
pass vunit_lib.tb_queue_performance.Push/pop 2^21 integers (10.0 seconds)
pass vunit_lib.tb_queue_performance.Push/pop 2^22 integers (20.1 seconds)
===================================================================================
pass 10 of 10
===================================================================================
Total time was 71.2 seconds
Elapsed time was 71.2 seconds
===================================================================================
All passed!
And here are the results for my implementation
==== Summary ======================================================================
pass vunit_lib.tb_queue_performance.Push then pop 2^16 integers (0.4 seconds)
pass vunit_lib.tb_queue_performance.Push then pop 2^18 integers (1.0 seconds)
pass vunit_lib.tb_queue_performance.Push then pop 2^20 integers (3.6 seconds)
pass vunit_lib.tb_queue_performance.Push then pop 2^21 integers (7.2 seconds)
pass vunit_lib.tb_queue_performance.Push then pop 2^22 integers (14.8 seconds)
pass vunit_lib.tb_queue_performance.Push/pop 2^16 integers (0.4 seconds)
pass vunit_lib.tb_queue_performance.Push/pop 2^18 integers (1.0 seconds)
pass vunit_lib.tb_queue_performance.Push/pop 2^20 integers (3.5 seconds)
pass vunit_lib.tb_queue_performance.Push/pop 2^21 integers (6.9 seconds)
pass vunit_lib.tb_queue_performance.Push/pop 2^22 integers (13.5 seconds)
===================================================================================
pass 10 of 10
===================================================================================
Total time was 52.4 seconds
Elapsed time was 52.4 seconds
===================================================================================
All passed!
So far, my implementation is better in every test. The advantage of a circular buffer is especially evident in the "Push/pop" tests. My implementation never has to resize the buffer, but the current implementation has to resize it regularly.
Are there any other performance tests you would like me to try?
With your permission, I would like to clean up my implementation and push it. The new queue_t
implementation will also facilitate my implementation of element_t
in a manner that is compatible with VHDL-93 (see my rambling posts above). When it all comes together, I think the new typing system will be really elegant.
@bradleyharden, that sounds really good. I'm looking forward to read the PR!
Great ideas, but it seems like nothing has happened for a few years?
No, this has been sleeping for quite some time but I think it will be revisited as we continue with our Python integration in #986.
I mentioned this on Gitter, but I think it might be best to create an issue to consolidate the discussion and make sure comments don't get lost.
I would like to develop a generic dynamic typing system. I'm drawing inspiration from the
queue_element_type_t
and trying to generalize and expose that as a codified interface. My goal is to provide a generic type mark that contains everything you need to know to encode and decode a value of that type. What I'm aiming for is something like thisFor scalar types, size represents the number of bytes to encode a value. For array types, size represents the number of bytes to encode a single element of the array. Variable types have undetermined size. We can provide a function to determine the encoded length of any type
However, I will probably end up using a numerator and a denominator instead of a real, since the size is always a rational number. That way we can use integer division to calculate the size.
My initial thought was that you could declare all types as constants of this record type. However, to use it as a type mark would require that you always encode the entire record with the encoded data. Instead, it would be easier to implement it as
Then, we can use the type ID as an index into an array of
string_ptr
for getting the type name, an array oftype_class_t
for getting the type class, and an array ofreal
for getting the size. This is the same strategy asmsg_type_t
.Next, we can define an element
I'm not sure whether to keep the range as a
range_t
, i.e.bit_vector
, or to store it as an encoded range. It probably makes more sense to store it asstring(1 to 9)
. The simulator's internal representation ofrange_t
will need to be at least 9 bytes anyway, and then the size doesn't vary based on the range.You could convert between
element_t
and all other types with the functionsto_element
andfrom_element
, which would essentially be light wrappers aroundencode
anddecode
. The aliases forfrom_element
could also beto_std_logic_vector
, etc. to mimic a real type conversion.With this record, we can reformulate
queue_pkg
to push and pop elements. The string representation inside the queue wouldn't change, but every push and pop would be forced throughpush_element
andpop_element
.In fact, that brings up a separate point. Why is
queue_t
implemented with a single, variable lengthstring
? Why not implement it as a variable lengthinteger_vector
and treat each integer as astring_ptr
to an element? Then you never have to usepush_variable_string
and you also know the how many elements are in the queue at any moment. To me, that seems like a much better way to implement it. Is there some reason the current implementation was chosen instead?Anyway,
element_t
is great for pushing things around internally, using unconstrained function inputs and return values, but it isn't easy for users at the top level. There, I think it makes more sense to give themI'm not sure whether to use separate string pointers for the range and code here, or to combine them into a single
string_ptr
. Either way, this record is of fixed size and can be easily manipulated at the top level.With
element_t
andelement_ptr_t
you could perform type conversions of the underlying coded data. To facilitate that, I would like to change the encoding of all base types to use the standard binary representation. For example, instead of using'T'
and'F'
for aboolean
, usecharacter'val(0)
andcharacter'val(1)
That way, you could easily convert betweenboolean
and aunit8
type. Or you could convert areal
type into an array of 4uint16
types. To accomplish this withreal
, we could use theto_real
function fromfloat_pkg
, rather than the custom implementation right now.Next, you can define
array_t
as an array ofelement_t
, wherearray_t
stores each element internally as a pointer to an encoded string. In effect,element_t
acts as the container that @kraigher mentioned one time on Gitter when discussing arrays of arbitrary type.Finally, you could define
Using this, users could dynamically create a new
type_t
and append elements to thedynamic_t
to define their own containers for custom records.Also,
dict_t
could be modified to acceptelement_t
anddynamic_t
as keys and values. You could also create alist_t
based onarray_t
with a focus on list operations rather than a general array type, likeinteger_array_t
. I think there are many different directions you could take this.Separately, I also have an idea for a
bit_array_t
. Its purpose would be to store elements of arbitrary bit width. Internally, the data would be stored as one largestring
. Eachbit_array
would store arange_t
that indicates the size of each element. Theget
andset
functions would pull out a sub string and decode it to set/get the desired bits. You could even change therange_t
dynamically to change your view of thebit_array
. This is similar toread_word
andwrite_word
inmemory_t
, but not restricted to 8-bit boundaries. @umarcor might be interested in this as the basis for co-simulation. I think it would be the most flexible option.