Investigate default value of ``Label.to_dict(..., skip_attributes=bool)``

LevN0 commented 1 year ago

The Label class provides a method to convert the label from Label (which is effectively a subclass of ElementTree) to a dictionary, namely Label.to_dict(...).

The method includes a number of optional arguments.

One of the optional arguments is skip_attributes. The effect of this argument is as follows:

Original XML,

<Table_Character>
    <offset unit="byte">0</offset>
    <records>76</records>
    ...
</Table_Character>

Converted to dict via Label.to_dict(skip_attributes=True),

{
    "Table_Character": {
        "offset": "0",
        "records": "76"
    }
}

Converted to dict via Label.to_dict(skip_attributes=False),

{
    "Table_Character": {
        "offset": {
            "@unit": "byte",
            "_text": "0"
        },
       "records": "76"
    }
}

The help string for this argument states "If True, skips adding attributes from XML. Defaults to False." However, despite this text, the argument is actually set to True by default, i.e. attributes are skipped when converting to dict.

Need to investigate whether the argument should be changed to False by default as the help string says, or whether the help string should be adjusted.

LevN0 commented 1 year ago

@cmillion @m-stclair

I discovered this issue while looking at this text in pdr.

The docstring for the to_dict method. It also includes e.g. a cast_values argument.

m-stclair commented 1 year ago

Sorry, missed your tag.

We definitely missed the existence of skip_attributes and will probably rewrite in order to set it True.

My knee-jerk impulse is to say that to_dict should not by default discard XML attributes. There is certainly the argument that it's more user-friendly -- if a user writes:

coordinate = label['Product_Observational']['Observation_Area']['Discipline_Area']['cart:Cartography']['cart:Spatial_Domain']['cart:Bounding_Coordinates']['cart:west_bounding_coordinate']

they probably expect to get str or float, not an OrderedDict with the actual coordinate as the value of its '_text' key.

On the other hand, there is also the argument that -- at least I think this is true -- every PDS4 attribute that allows the use of XML attributes also mandates the use of XML attributes, so you can always know the expected type.

Fenreg commented 11 months ago

Dear all,

Regarding the issue at hand, the discrepancy between the set value for the skip_attributes option in the code and the reported default option documentation, I am of the opinion that the code value should match the reported documentation first and foremost.

Considering the larger question, i.e. how to deal with attributes, I am of a more radical opinion and although several PDS4 dictionaries do not include labels requiring attributes, I would, as a user, prefer to be confronted to a single datatype rather alternating system. Therefore, I would favor a solution in which the user would have an datatype listing value, unit for each and any label.

LevN0 commented 11 months ago

@Fenreg

To clarify, you would prefer,

<Table_Character>
    <offset unit="byte">0</offset>
    <records>76</records>
    ...
</Table_Character>

Translate to,

{
    "Table_Character": {
        "offset": {
            "@unit": "byte",
            "_text": "0"
        },
        "records": {
            "_text": "76"
        }
    }
}

Did I understand correctly?

m-stclair commented 11 months ago

I understand @fenreg as meaning that this snippet should translate to:

    "Table_Character": {
        "offset": {
            "@unit": "byte",
            "_text": "0"
        },
        "records": {
           "@unit": None,
            "_text": "76"
        }

in order to create complete consistency between expected keys of mappings associated with all PDS4 attributes. For complete consistency, however, a solution this should include all legal XML attributes associated with any PDS4 attribute -- for instance, both "offset" and "records" above -- and maybe even "Table_Character" -- should also have "@nil": None and "@nilReason": None items, in addition to many others.

I am concerned that a solution like this will make parsed labels difficult to read and require additional filtering code for users who wish to translate these Python objects back into XML.

Fenreg commented 11 months ago

I do apologize, @LevN0, for not answering sooner and I do thank you @m-stclair for highlighting the logic of my previous answer.

This question that you raise, @m-stclair, is actually the one I am dealing with presently, as I have to create ancillary data to science observations and need to copy general information relative to these observations back into the main PDS4 dictionaries of their lblx files. I have implemented the corrective to ensure the lecture of all attributes as discussed above, and in the end, either way works in my view, i.e. setting an exception for a specific label, or - as I did prior to finding out about skip_attributes - having a "to_dict" method to include a "unit" key for each and every label.

All the best,

LevN0 commented 11 months ago

The one thing I definitely will not do is add attributes not in the label into the result. This would require having any dictionary at hand, which the code does not and will not. The most data that can be included by PDS4 Tools here is what is present in the label itself.

Thus the options for label.to_dict(..., skip_attributes=False) for,

<Table_Character>
    <offset unit="byte">0</offset>
    <records>76</records>
    ...
</Table_Character>

are either,

{
    "Table_Character": {
        "offset": {
            "@unit": "byte",
            "_text": "0"
        },
        "records": {
            "_text": "76"
        }
    }
}

or

{
    "Table_Character": {
        "offset": {
            "@unit": "byte",
            "_text": "0"
        },
       "records": "76"
    }
}

m-stclair commented 11 months ago

On consideration, my vote is:

Make skip_attributes=True by default. For most applications, the XML attributes in PDS4 labels are not important.
When skip_attributes=False, always make the values of keys corresponding to PDS4 attributes dicts, whether or not they have XML attributes. This is basically because type consistency is convenient.

There are basically no options here that are not annoying compromises: XML elements simply do not cleanly translate to idiomatic Python Mappings. XML is not JSON. If a user needs perfectly reliable access to all XML affordances, they should use the Label object itself , or supplement pds4-tools with a high-level XML manipulation toolchain like bs4 + lxml.

m-stclair commented 11 months ago

I take it back: I vote don't make every value corresponding to a PDS4 attribute a dict when skip_attributes=False. This might have been a good data design decision at some point, but it will be a catastrophically breaking change for many pieces of downstream software, which is a high cost for a tweaky little marginal improvement.

Small-Bodies-Node / pds4_tools

Investigate default value of ``Label.to_dict(..., skip_attributes=bool)`` #86