Closed lblatchford closed 10 months ago
This is an interesting issue. The reason this happens is because this schema uses textNumberPadCharacter
to trim/pad leading zeros. The schema essentially does this:
<element name="foo" type="xs:int"
dfdl:lengthKind="explicit" dfdl:length="5"
dfdl:textPadKind="padChar" dfdl:textTrimKind="padChar"
dfdl:textNumberPadCharacter="0" dfdl:textNumberJustification="right"
dfdl:textNumberPattern="#0" />
This works fine for positive numbers, but negative numbers with three or fewer digits unparse to something like this: 0-123
because we add the padding ourselves after the number is turned into a string.
One might instead think to model it like this, using textNumberPattern
to have ICU do the padding instead of using textPadKind
and textTrimKind
:
<element name="foo" type="xs:int"
dfdl:lengthKind="explicit" dfdl:length="5"
dfdl:textNumberPattern="00000;-0000" />
But that doesn't work either. This is because the negative part of the pattern is only used to determine the negative sign, and not the number of digits. And with the positive pattern being 00000
, it will always unparse 5 digits even if there is a minus sign. So this unparses a three digit negative number to this: -00123
which is 6 characters long instead of 5.
I also tried this using the #
character (which does specify the length in characters instead of digits) and specifying a pad character using *0
, like this:
<element name="foo" type="xs:int"
dfdl:lengthKind="explicit" dfdl:length="5"
dfdl:textNumberPattern="*0####0" />
This does the right thing with positive numbers, but with negative numbers ICU pads to the left of the negative prefix, so -123 unparses to 0-123
, same issue as we currently have. There doesn't seem to be a way to tell ICU to pad to the right of the negative prefix using just the pattern unless there is also a positive prefix. I considered a pattern that has a "zero length" positive prefix followed by the pad char (which should say to pad after the negative prefix for negative numbers), but the only way I could think of a zero length prefix is ''
and that is just an escaped single quote, so isn't zero length.There is a setPadPosition
function that should do it, but there is no property in DFDL that would set that.
@mbeckerle, any thoughts on how we can accomplish this? We could just treat the field as a string, but that loses validation and removing padding zeros. Maybe we need a new DFDL property?
Not sure this will work, but the textNumberPattern also has its own padding mechanism independent of dfdl:textPadKind dfdl:textTrimKind, and dfdl:textPadCharacter.
It is introduced by the "" pattern character followed by the padding character to be used. So "0" in this case.
A pattern like "00000;-*0#0000" might work. I believe this means 5 wide for positive values, but 4 wide for negative but one of them will be the "-" sign, and the zero padding (if needed) to overall width 5 is AFTER the "-" prefix. I've not used this feature extensively though, so it's hard to say. But I think -123 would unparse as -0123.
But if not I would suggest this kind of technique, which is guaranteed to be able to work. The object needs to be modeled as a choice of two element types, sort of like this: (Untested)
<choice>
<element name="negPair">
<annotation><appinfo source="http://www.ogf.org/dfdl/">
<dfdl:discriminator testKind="pattern" testPattern="-[0-9]{4}-[0-9]{4}"/> <!-- has minus signs? -->
</appinfo></annotation>
<complexType>
<sequence>
<element name="x" type="xs:int" dfdl:textNumberPattern="#;-0000"/> <!-- negative X -->
<element name="Y" type="xs:int" dfdl:textNumberPattern="#;-0000"/> <!-- negative Y -->
</sequence>
</complexType>
</element>
<element name="posPair">
<complexType>
<sequence>
<element name="x" type="xs:int" dfdl:textNumberPattern="00000"/> <!-- positive X -->
<element name="Y" type="xs:int" dfdl:textNumberPattern="00000"/> <!-- positive Y -->
</sequence>
</complexType>
</element>
@stevedlawrence Could you provide the code following Mike's suggestion using choice that worked for you? I'm still getting the same unparse results when I try this.
I have not made Mikes suggested changes, I just think the concept should work.
Looking at it again, I think it might not work because the negative pattern in textNumberPattern isn't used for lengths, it's only used for the negative prefix/suffix (e.g. the minus sign). So the negPair textNumberPatterns want to look like this: dfdl:textNumberPattern="0000"
. So only four digits long and a negative sign is implied.
Though, we are discussing some possible changes for a future version of daffodil that would avoid the hassle of choices in the reference PR.
Ok your comment here indicated it worked so I was hopeful: https://github.com/apache/daffodil/pull/1138#issuecomment-1884877300
4 digits long with a negative sign is correct for negative numbers, and 5 digits is correct for positive numbers.
I did note that you're investigating options but unfortunately I need a solution that works now with Daffodil 3.6.0.
Ok your comment here indicated it worked so I was hopeful: apache/daffodil#1138 (comment)
4 digits long with a negative sign is correct for negative numbers, and 5 digits is correct for positive numbers.
I did note that you're investigating options but unfortunately I need a solution that works now with Daffodil 3.6.0.
Did you see my proposed workaround in this comment: https://github.com/DFDLSchemas/NITF/issues/19#issuecomment-1883864835 ?
It's not elegant, but it will allow total validation of data for a Cyberian application.
@mbeckerle Yes, that's what I tried, but the unparse is still producing a SLOC value of 0-1530-123
This is what I have in the updated NITF schema. Did I use your suggestion incorrectly?
<xs:element name="GraphicLocation">
<xs:complexType>
<xs:choice>
<xs:element name="negPair">
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:discriminator testKind="pattern" testPattern="-[0-9]{4}-[0-9]{4}"/>
</xs:appinfo>
</xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element name="Row" type="xs:long" dfdl:lengthKind="explicit" dfdl:length="5" dfdl:textNumberPattern="#;-0000"/>
<xs:element name="Column" type="xs:long" dfdl:lengthKind="explicit" dfdl:length="5" dfdl:textNumberPattern="#;0000"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="posPair">
<xs:complexType>
<xs:sequence>
<xs:element name="Row" type="xs:long" dfdl:lengthKind="explicit" dfdl:length="5" dfdl:textNumberPattern="00000"/>
<xs:element name="Column" type="xs:long" dfdl:lengthKind="explicit" dfdl:length="5" dfdl:textNumberPattern="00000"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
</xs:element>
Of course my untested stuff doesn't work.
The idea here is you pick off the minus signs and treat everything as positive numbers
In the negative pair, you add a "-" initiator to Row and Column, and make them length 4.
<xs:element dfdl:initiator="-" name="Row" type="xs:long" dfdl:lengthKind="explicit" dfdl:length="4" dfdl:textNumberPattern="0000"/>
<xs:element dfdl:initiator="-" name="Column" type="xs:long" dfdl:lengthKind="explicit" dfdl:length="4" dfdl:textNumberPattern="0000"/>
While this produces the correct result on unparse, the parse result loses the sign, so -0153-0123 is parsed to
<negPair>
<Row>153</Row>
<Column>123</Column>
</negPair>
While this produces the correct result on unparse, the parse result loses the sign, so -0153-0123 is parsed to
That's right. The sign has become "negPair" surrounding element.
I took a squint at the NITF spec. So I think ILOC SLOC etc can be not just pairs of negative and pairs of positive, but any mix of positive and negative values. In the RRRRRCCCCC notation the RRRRR is down (positive) or up (negative), and the CCCCC is right (positive) or left (negative). Combinations of up and right are possible, as are down and left.
So I would suggest creating elements named down or up for the RRRRR part, and right or left for the CCCCC part.
So I think this is better for ILOC, SLOC, etc.
<choice>
<element name="up" dfdl:initiator="-" type="uintNot0_4" />
<element name="down" type="uint_5" />
</choice>
<choice>
<element name="left" dfdl:initiator="-" type="uintNot0_4" />
<element name="right" type="uint_5" />
</choice>
<simpleType name="uintNot0_4" dfdl:lengthKind="explicit" dfdl:length="4" dfdl:textNumberPattern="0000">
<restriction base="BCS-N">
<minInclusive value="1"/>
<maxInclusive vaue="9999"/>
</restriction>
</simpleType>
<simpleType name="uint_5" dfdl:lengthKind="explicit" dfdl:length="5" dfdl:textNumberPattern="00000">
<restriction base="BCS-N">
<minInclusive value="0"/>
<maxInclusive vaue="99999"/>
</restriction>
</simpleType>
It's a hack, but up down left and right are used in the spec, so it's not mysterious.
After spending a week coming up with a fix to daffodil, a textNumberPattern just popped into my head that should work with current versions of Daffodil. It does means you'll need a textPattern specific for every length of negative number you need, but I think that's only needed for lengths 1 to 9, so not too bad . For example:
<!--
copy this for all lengths of signed numbers needed, changing
dfdl:length, textNumberPattern, and in min/maxInclusive
-->
<xs:simpleType name="BCS-N_5" dfdl:lengthKind="explicit" dfdl:length="5" dfdl:textNumberPattern="*0#0000">
<xs:restriction base="xs:int">
<xs:minInclusive value="-9999" />
<xs:maxInclusive value="99999" />
</xs:restriction>
</xs:simpleType>
<xs:element name="Row" type="BCS-N_5" />
The textNumberPattern
says that there must be a minimum of 4 digits (because of the 4 right zeros) and to pad to 5 characters (because of the additional single #
) with a 0
character (because of the leading *0
).
For positive numbers that are 5 digits, no padding will be added (it already has at least 4 digits and length is 5 characters).
For positive numbers that are 4 digits or less, ICU will add 0's to get it to 4 digits, and then pad a single 0 character.
For negative numbers (which must be 4 digits or less), ICU will add 0's get it to 4 digits, and it will also add the minus sign, which gets it to 5 characters so no padding is needed.
This works in current versions of Daffodil. Note that the only reason this works is because the pad character '0' is the same as the '0' digit added to get it to 4 digits. If these numbers were padded with spaces to the right of the negative sign instead of zeros, this trick wouldn't work and we would need the fix we've discussed in Daffodil. So all the discussions and fixes this week weren't for nothing, but aren't really needed to fix this particular issue.
For consistency and for validation, we should probably do something similar for positive numbers (I think the only lengths needed are 1 and 5?), but with a simpler textNumberpattern:
<xs:simpleType name="BCS-NP_5" dfdl:lengthKind="explicit" dfdl:length="5" dfdl:textNumberPattern="00000">
<xs:restriction base="xs:unsignedInt">
<xs:minInclusive value="0" />
<xs:maxInclusive value="99999" />
</xs:restriction>
</xs:simpleType>
...
<xs:element name="FileCopyNumber" type="nitfc:BCS-NP_5" />
Also note that textTrimKind an textPadKind are "none" for these new number types, since this doesn't rely on Daffodil to trim/pad anymore, ICU does it all now. I didn't specify the properties on these simple types because NITF already sets these properties to "none" by default.
I think I can do this all with a quick search and replace, I'll have a PR shortly.
Very clever, and this is the quintessential example of "hard won knowledge", having spent a week of your time, and a day of mine and much of lbatchford's etc.
I want to create a web page on the Daffodil site of DFDL idioms so that this knowledge gets indexed and is searchable somehow.
Or maybe we should just ask and re-answer this question on stack overflow (since the AI LLMs get trained from it)?
I've opened a PR that fixes this.
One minor addition I found is we need to provide dfdl:textStandardZeroRep="00000"
. It seems that if the string to parse is all zeros and the pad character is zero, then ICU sees the string as all padding, doesn't find any digits to parse, and fails to parse it. Seems like a bug in ICU, but fortunately we have textStandardZeroRep
to get around it.
I appreciate all the options - I've been testing with Mike's last suggestion and now I'll try Steve's suggestion.
Note that the same parsing issue affects 4 fields: SLOC, ILOC, SBND1 and SBND2
Note that the same parsing issue affects 4 fields: SLOC, ILOC, SBND1 and SBND2
I've merged the fix, I think this should fix it for those fields as well.
@stevedlawrence Looks good to me. I'm working with an older version of the schema and made the same changes to my copy and it worked great. Thanks to you and @mbeckerle for the quick assistance!
I suspect this issue will affect any field with type BCS-N that contains a negative value.
The SLOC field (GraphicLocation element in the DFDL NITF XML) is a BCS-N field with format RRRRRCCCCC. The row (RRRRR) and column (CCCCC) may either be both positive in the range 00000-99999 or both negative in the range -0001 to -9999.
It was observed that a field value of -0153-0123 was parsed correctly, the Row element contains -153, the Column element contains -123.
On unparse, the SLOC field is set to 0-1530-123 which is invalid and causes errors in applications reading the resultant NITF file.