Closed liar666 closed 7 years ago
It is per design that blanks, empty strings, and null values are all considered to be "no value". To have a different behavior, you will have to create your own tagger or we can make it a feature request to be able to define what "no value" should mean (with an new flag).
Hi,
That would be a great feature to add. Also, distinguishing between null&empty as the default behaviour for this option would be the best :)
Apart from the fact that I, as a computer scientist, find it quite un-natural not to consider null&empty the same, this behaviour also makes things awkward. Indeed, with the current configuration, the only solution I found to merge my FNames & MNames correctly is to set a unprobable String as defaultValue, do the merge, then use a ReplaceTagger/regex to remove my special defaultValue...
Thanks again!
Re-opening, to keep track of it as a feature request.
The latest Importer snapshot now addresses this feature request. You can now specify a matchBlanks="true"
flag on your <dom ...
configuration to have the DOMTagger extract elements with blank values. Because the underlying parser (JSoup) normalizes white spaces in XML values, you will always get an empty string in cases where you have spaces only (but at least you'll have a match returned as opposed to none).
The handling of the defaultValue
attribute has been modified to support empty strings and spaces (but null
is still used when no default values are specified).
Have a try and please confirm.
The new features are now officially released. Re-open or create a new ticket if you have issues with them.
Hi,
I've built a minimal test which demonstrate that there is a confusion between the fact that a tag/attribute has no value (i.e. it is not present) and it has an empty value (i.e. it is present in the document but with an empty String(*) as a value).
I believe this is a bug, in the same manner as
List a = null
in Java is different toList a = new List()
.Also, mixed with the "new" defaultValue feature of DOMTagger, this mixes things up. Indeed, since a tag that is present but with a defined value of empty String will be considered as not present, it will be replaced by the defaultValue, thus loosing its original (empty) value.
The provided example files demonstrate the problem: they extract separately FirstNames, MiddleNames and LastNames, then merge First&Middle Names into a new FirstName. To ensure we have the same number of values for all attributes of an author, we use the defaultValue option. If you run this example you get:
As can be seen, the empty String used for middlename22 and middlename23 are replaced by "NO_MIDDLE-NAME", in the same manner as for the non-present middlename31 and middlename33, which is not what I would expect.
Also this is not in concordance with what JSoup resturns:
As can be seen, JSoup makes a difference between when the tag is there and set to an empty string (empty lines are returned in place of middlename22 and middlename23) and when the tag is not set (no line is returned for middlename31 and middlename33).
By the way, I think this is a general problem in DOMTagger (and maybe other Norconex's products?), since this confusion also occurs in the treatment of the "defaultValue" itself.
Indeed, in the same example files, if you replace the
defaultValue
with an empty string (as in the first commented-out line), then thedefaultValue
replaces a non-existent value by another non-existent (thus dropped) value, instead of an empty String(*). As a consequence, thedefaultValue
principle has absolutely no effect and we end up with different number of MiddleNames as of First&LastNames... Which is not what I would expect...(*)Finally, note that, since you seem to trim() all the entries in the XML files, the problem does not only occurr for empty Strings, butalso for any sequence of spaces chars E.g., defining the defaultValue to any (possibly empty) sequence of spaces, will result in this option having no effect (this is what the 2nd commented-out line demonstrates).
example2.xml.txt testDefVal2.xml.txt