ddasilva commented 1 year ago

This ticket is to support dynamic packet definitions that change their definitions per-packet based on the header. This came out of conversations with @tloubrieu-jpl and @nischayn99 about a particular packet they have on Europa Clipper which changes its definition based on a header field (which isn't the secondary header field, even).

I came up with two ways we can support this. I'm writing them here so others can discuss and weigh in.

Packet Factory

We can have a PacketFactory, with its own load() method, which allows you provide a function reference that returns a packet definition based on the primary header. For each packet, the primary header would be parsed, and then the provided function would be called, and the packet would be re-parsed with the definition provided by the function. An example of what this would look like is as follows.

def packet_factory_fun(headers):
       if headers[xxx] == yyy:
           return VariableLength([
                  ....
            ])
        elif  headers[xxx] == zzz:
           return VariableLength([
                  ....                           # different than before
            ])
        else:
            return VariableLength([
                  ....                           # even more different 
           ])

pkt_fac = PacketFactory(packet_factory_fun)
result = pkt_fac.load('telemetry.bin')

If you wanted to parse more than the primary header and use that to decide on the rest of the packet, you could pass in an optional argument to provide that "decision" definition.

def packet_factory_fun(initial_results):
    # make decision based on result of parsing with initial_defs
    ...

initial_defs= VariableLength([
    # initial fields to parse
    ...
])

pkt_fac = PacketFactory(packet_factory_fun, initial_defs=initial_defs)
result = pkt_fac.load('telemetry.bin')

Define fields conditionally based on Expression

I don't think this is the best option, but I think it's a bit complex but I will mention it here. We could add a a keyword argument condition= that you could set to some very simply expression that would be evaluated for each packet to determine if the field is included. For example, if the field is only included if the secondary header field is none, you could add condition="CCSDS_SECONDARY_FLAG==1".

I don't think this is the best because adding the ability to parse these expressions is going to add a lot of complexity. We could also just replace the string expression with a callable (probably a lamba) and do something like condition=(lambda headers: headers['CCSDS_SECONDARY_FLAG'] == 1), but I also think that doing that multiple times might be messier than using a factory.

pkt = ccsdspy.FixedLength([
     PacketField(name='SHCOARSE', data_type='uint', bit_length=32, condition="CCSDS_SECONDARY_FLAG==1"),
     PacketField(name='SHFINE',   data_type='uint', bit_length=20, condition="CCSDS_SECONDARY_FLAG==1"),
     PacketField(name='OPMODE',   data_type='uint', bit_length=3),
     PacketField(name='SPACER',   data_type='fill', bit_length=1),
     PacketField(name='VOLTAGE',  data_type='int',  bit_length=8),
     PacketArray(
         name='SENSOR_GRID',
         data_type='uint',
         bit_length=16,
         array_shape=(32, 32),
         array_order='C'
     ),
])

ddasilva commented 1 year ago

@ehsteve @jmbhughes would you mind weighing on to let me know what you think?

jmbhughes commented 1 year ago

I think the PacketFactory strategy will work. A couple quick thoughts:

I prefer the first solution over the second. The first allows you to define a completely different packet structure depending on some condition. The second solution seems like the logic might get hard to follow. You could have a lot of different conditions and then it's difficult to see the full packet definition at a glance. (It could get really complicated if it's defined by conditional functions... the function could have a bunch of nested conditions or something.) Which raises one of my first questions, can there be a convenience function (in either scenario) to print out all the possible packet definitions based on their branches? And a way to interrogate which packet definition was used for which packet? It probably wouldn't be used in practice but would be a great debugging tool.
Is there a way to save rewriting redundant parts of packet definitions that don't depend on the header? I imagine it might be the case that only a few portions change and the rest is the same. Maybe this is already possible with clever writing of the VariableLength object, but a nice example in the documentation would be good.
For the last bit of your description about using initial_defs, is it possible to avoid reading the packet twice? You read a chunk of the packet to get the fields you need to determine what the rest of the packet is formatted like and then read again to figure out the rest? Or can you cleverly not have to reread the rest? It's an implementation detail that might impact speed, don't know.
If you do go for your second example, I would use functions passed in instead of expressions. As you pointed out, supporting expression parsing could get complex.

I'm happy to clarify any of these points and discuss them more. I think the key point is getting input from people who would use this feature. I do think it would solve one of the problems I raised at one point where the packet definition can change over time. Now, you would just define a PacketFactory with different definitions based on the primary header date information.

tloubrieu-jpl commented 1 year ago

@ddasilva I like the packet factory solution. As @jmbhughes mentioned that could be interesting not to define the packet twice but it is not critical.

I was thinking we might be able to make the case where the packet structure depends on a field value beyond the header a generalization of the case where the packet structure depends on the header: if the packet factory has a definition of a base packet with the generic fields the beginning of the packet, we could use any of the fields defined (in the default header or by the developer in the factory) to decide on the packet structure returned by the function used as argument of the packet factory constructor.

That would prevent us from having a specific sub case for handling decision from fields outside the header.

tloubrieu-jpl commented 1 year ago

Iteration on previous proposition:

After thinking a bit more, although I like the proposal, I don't think it actually acts as a factory, which, in my mind, and in object design patterns, is an object which has a get method returning a built object, here for example a variableLength packet structure.

In our legacy code we have a protean packet structure object which we name this way because the shape of the packet structure changes depending on the packets.

I would make the protean structure object a specialization of the variableLength object (or its parent object).

But otherwise it is very much alike what @ddasilva proposed above for when one needs to parse more that the header to find out what the full packet structure is.

On an example it is how I am thinking of that:

def get_other_fields(static_field_values: dict):
    if  static_field_values[HEADER_FIELD_NAME]== “this value”:
        return pseudo_apid1, [PacketField(…), …]
    elif static_field_values[OTHER_FIELD_NAME] == “another value”:
        return pseudo_apid2, [PacketField(…), …]
    …

packet_struct  = ProteanPacket([static_fields], dynamic_fields_fun=get_other_fields)
parsed_packets: dict = packet_struct .load(stream)

Note that since the structures of the packet differ that would be preferable to store them into distinct outputs that I proposed to group in a dictionary with a pseudo_apid as a key.

Different idea

Another totally different option which came up to my mind is to simply have a generalization of the split by apid util script to split streams on a non APID fields,. The developers would then parse each stream with its specific base packet definition. But then I don’t see a solution for splitting depending on a non header field.

Sorry if I am making things more confused. Your ideas with the experience you have on other types of packet might help.

ddasilva commented 1 year ago

I like the formulation that @tloubrieu-jpl presents a bit more. While it provides the same capability, the aesthetics are much better. Though, I might choose a difference name than Protean because while this is a real word I suspect most people wouldn't know what it means (and especially so for most non-native English speakers). I could be totally wrong here, though.

Right now, packet definitions aren't instantiated with an APID. What was your reasoning with having get_other_fields() return a pseudo apid?

ehsteve commented 1 year ago

I was worried at first that this may not be compatible with the CCSDS standard but, looking through the guidebook, nowhere does it say that an APID needs to define a singular non-changing packet. My worry was that it would be weird to add functionality to a library called CCSDSPy that is not compatible with the CCSDS standard!

Though i like the ideas presented above, i feel like they rely too much on defining packets through code rather than more readable definitions such as a csv and I think it is important to maintain that capability. It seems like the problem that is presented here is that the APID does not uniquely define a packet but a combination of the APID and some other packet field does. Could we not expand our packet definition to include an ID item that does uniquely define the packet format. Something like the following

pkt = ccsdspy.FixedLength([
     PackedID(value=256, bit_offset=[5, 32], bit_length=[11, 3]),
     PacketField(name='SHCOARSE', data_type='uint', bit_length=32),
     PacketField(name='SHFINE',   data_type='uint', bit_length=20),
     PacketField(name='OPMODE',   data_type='uint', bit_length=3),
     PacketField(name='SPACER',   data_type='fill', bit_length=1),
     PacketField(name='VOLTAGE',  data_type='int',  bit_length=8),
     PacketArray(
         name='SENSOR_GRID',
         data_type='uint',
         bit_length=16,
         array_shape=(32, 32),
         array_order='C'
     ),
])

In the above example, it would add the bits of the APID and the 3 bits at position 32 and if they equal 256, it make use of this definition to parse the packet. We could optionally allow the user to provide their own function to calculate the unique identifier. Hopefully adding this functionality to the parsing code would be easy as the parsing code itself does not have to change, we just have to add a gate at the front of the code to identify the packet. If there were many similar packets, it would mean repeating parts of the packet but I am not so worried about that since you can still write this in code and remove the repetitions if you wanted to with a loop of some kind.

By the way, this is similar to how COSMOS does it (see https://openc3.com/docs/v5/telemetry#append_id_item) though more advanced.

ddasilva commented 1 year ago

@ehsteve Your solution is interesting, and it would be nice to keep the ability to specify packets in csv. So when you specified a packet id like above, would the load(file_name) method only return the subset of packets in file_name that match the packet id specified, dropping the rest?

If you have multiple kinds of packets and you want to iterate through them in the order in which they appear, what would be the approach to interweave them back together?

This method also has the advantage that each field is there every time, so we don't have to deal with a scenario in the packet factory or protean packet solutions where some fields will have to become NaN/None/Masked.

ehsteve commented 1 year ago

@ddasilva could we add or expand the function split_by_apid to split the file into different packet types?

My longterm preferred approach would be to somehow allow a user to define multiple packets and let the load function figure it all out. split_by_apid always felt to me like a bit of a hack.

tloubrieu-jpl commented 1 year ago

Hi @ddasilva , @ehsteve ,

Sorry I missed your comments earlier.

The pseudo_apid that I was proposing meant to be able to differentiate the packet formats when a single APID is inconsistent with its format. I believe that is the same idea that you discussed after, and I am very open to other solutions to to uniquely identify packet formats beyond the APID so to be able to export the packets following a given format in a single CSV. The proposal of @ddasilva yesterday looks good to me.

On that topic, what we noticed when we started to use these non consistently formatted APIDs is that when we use the parsed output, we eventually want to regroup the different format tables belonging to a single APID together, by iterating on the lines in parallel for each format. That related to @ddasilva question yesterday. For doing that we maintain a dictionary dict[api, format_id]. That can be a property of the packet definition.

I am sorry that this makes the CCSDS standard more complex but these cases happen in real life, I think when the instrument team realizes they need additional format after the APID have been frozen at mission level.

At last for the name of the class I agree Protean is a bit pedantic, but again I would not go for Factory which has a specific meaning in design patterns and can be misleading. I asked chatGPT for other names, it proposed the following prefixes:

morpho
amorphous
mutable

I like mutable, what do you think ?

ehsteve commented 1 year ago

@tloubrieu-jpl sorry! yes i think we are saying the same thing for the packet identifier. Why do we need a new class though? What I am suggesting is that we just add this functionality to the existing class. Do we want users to have try to figure out what class they need beyond Fixed or Variable to understand how to parse their packets?

ddasilva commented 1 year ago

My longterm preferred approach would be to somehow allow a user to define multiple packets and let the load function figure it all out. split_by_apid always felt to me like a bit of a hack.

Yeah, we have all the tooling developed, it's just a matter of making it convenient. This could be done a couple ways. Without adding a new class, we could simply do it like this:

from ccsdspy import utils

packet_defs = {
   1035: FixedLength(...),
   1099: VariableLength(...),
   #...
}

for apid, packet_dict in utils.iter_mixed_file("my_mixed_file.bin", packet_defs):
    print(f"{apid} - {packet_dict}")

@tloubrieu-jpl sorry! yes i think we are saying the same thing for the packet identifier. Why do we need a new class though? What I am suggesting is that we just add this functionality to the existing class. Do we want users to have try to figure out what class they need beyond Fixed or Variable to understand how to parse their packets?

I am also averse to adding a new class. We could just add the dynamic_fields_fun=fun keyword to the VariableLength constructor, and not add a new class. In the long term future, I think it would make sense to deprecate FixedLength and VariableLength in factor of PacketDefinition, and then have the library figure out which decoder to use automatically (maybe you can override it's decision with load(file, engine="fixed_length")). But I don't want to get to too off-topic for this thread (if you want to discuss it further, let's open a new issue).

I do agree that the ability to define these definitions in csv files is a big advantage, and being familiar to COSMOS users is a plus. I think to start with, the easiest thing is to do the PacketId() approach. In the future, we could talk about supporting the dynamic_fields_fun=fun, because it technically allows scenarios that PacketId() doesn't.

Would the PacketId() approach work for your immediate needs @tloubrieu-jpl? Technically this isn't as expressive as the callable solution we discussed, but it hits every use case I have heard about so far (including those outside this conversation).

tloubrieu-jpl commented 1 year ago

Hi @ddasilva , can you remind me what the PacketId() option is ? Thanks

ddasilva commented 1 year ago

Basically, you would define a field using the PacketId class, which has a similar interface to PacketField, but it also accepts a value. When a binary stream is parsed with a definition that includes a PacketId, the load() method will only return the subset of packets where the field corresponding to the PacketId has the value supplied. This is how I interpreted @ehsteve's suggestion. Can you confirm @ehsteve?

For example, this definition would only return packets where the first field APID_SUBSPECIFIER has a value of 5. And when it finds such a packet, it is decoded with the rest of the definition provided here. If you have a different packet layout when APID_SUBSPECIFIER is 7, then you would define another packet using value=7 using a different set of fields that follow it.

pkt = ccsdspy.FixedLength([
     PacketID(name="APID_SUBSPECIFIER", value=5, bit_length=3, data_type="uint"),
     PacketField(name='SHCOARSE', data_type='uint', bit_length=32),
     PacketField(name='SHFINE',   data_type='uint', bit_length=20),
     PacketField(name='OPMODE',   data_type='uint', bit_length=3),
     PacketField(name='SPACER',   data_type='fill', bit_length=1),
     PacketField(name='VOLTAGE',  data_type='int',  bit_length=8),
     PacketArray(
         name='SENSOR_GRID',
         data_type='uint',
         bit_length=16,
         array_shape=(32, 32),
         array_order='C'
     ),
])

ehsteve commented 1 year ago

@ddasilva yes that is what I suggested. I worry that your example for PacketId is not versatile enough though. What if you need to check two fields in the packet to uniquely identify it?

Also do we want to make PacketId required? That would make the packet definition explicit in all cases even when just using APID. It does not need to become required right away. We could issue a warning in the next release that this is coming and add it to the release after that.

ddasilva commented 1 year ago

I'm a bit concerned that if we allow logical combinations of PacketId values, we'll get ourselves into a mess where we'll have to implement a small logic engine, and then risk having it creep into a small interpreter. This was what I didn't like about the second proposed solution in the original text of this issue. I think that if you want to do logic, it's fair to make you define it in python.

So, what I would propose to address this issue is implement just PacketId's if they work for @tloubrieu-jpl's use cases, and if they don't, then we'll also implement the dynamic_field_fun=fun option.

Making PacketId required in the long term for APIDs sounds reasonable, but maybe we should talk about this in a video call in the future to make sure we're clear about how this would work (all would be invited to attend).

tloubrieu-jpl commented 1 year ago

Hi @ddasilva. Sorry for my late reply, I wanted to review our code more accuratelly but did not have time to do so, but I know the PacketId option will work in at least one of our case that Nischay is working on.

Thanks,

Thomas

ddasilva commented 1 year ago

Thanks, that's great @tloubrieu-jpl. Would it be possible for @nischayn99 to provide me with some test data, definitions, and example output (to the extent such is available)?

tloubrieu-jpl commented 1 year ago

Hi @ddasilva , @nischayn99 can prepare VariableLength packets using PacketId which should work with a test binary file that we'll send along. That should be done by the end of week.

tloubrieu-jpl commented 1 year ago

Hi @ddasilva , sorry for the long delay here.

After looking at our case, I am actually not sure if the PacketId will work for us.

In our packets we have a field called SCI0TYPE. When its value is 0x1, the packet structure is the metadata type. When it is different from 0x1 (0x2, 0x4, 0x8, 0x10, 0x20, 0x40) we have a different packet structure.

Do you think the PacketId option would work in this case ? I am asking because you said you don't want the value attribute to support a complex semantic, in this case that would be value != 0x1 or value in set(0x2, 0x4 ....).

Or maybe, we could consider a default packet structure (when there is no PacketId) and one which only applies where a PacketId is defined and has a given value ?

As you suggested, we can organize a meeting to discuss that...

Thanks,

Thomas

ddasilva commented 1 year ago

Thanks for your confirmation. @tloubrieu-jpl . I believe there is a way to support your use case without adding any other functionality. Can you do this?

Define a "decision packet" definition with this key field, followed by an expanding field
Define your two other full packet definitions, A and B
Loop through packet bytes using utils.iter_packet_bytes()
- parse each packet bytes with the decision packet definition
- check your condition on the result, and then reparse with A or B

In the interest of time and an ending summer internship, this might be the fastest solution to get you going before we come up with a more streamlined long-term implementation.

tloubrieu-jpl commented 1 year ago

Thanks @ddasilva , this is finally much more simple that anything we elaborated in our current parser and in this ticket before, I like that. Thank you again for the pragmatic solution, that would cover all our cases.

tloubrieu-jpl commented 1 year ago

By the way, we had a short prolongation of 60hours over 3 weeks of @nischayn99 internship.

CCSDSPy / ccsdspy

Support dynamic packet definitions based on header #72

Packet Factory

Define fields conditionally based on Expression

Iteration on previous proposition:

Different idea