Closed liar666 closed 8 years ago
This is a case where the same field is defined twice in different case. You define "DESCRIPTION" in your config, but "description" exists in the document. Then you have values stored for both and since the metadata are case insensitive by default when you retrieve, you get the combined values of both.
Simply make your field "description" in lower case in your config, or even better, give it a more unique name to avoid collision. As an alternative, you can also use the CharacterCaseTagger
to change the case of field names.
OK Thanks. Since I'm a Unix developer, I have a tendency to always consider that things are case sensitive. That's why I originally used ALL_CAPS for my tags ; I thought they would thus not interfere with the already existing tags. I've solved the problem by using your suggestion to use more unique names to avoid collision. However, it is quite difficult to guarantee unicity of "my" names, when I don't know what kind of meta-tags the original author of the website might use...
Good point about sometime being though to avoid collision. One trick I personally like to do is come up with a prefix for all my own tags that's relevant to the project I am working on, or the client name. Like "acme-description", "acme-whatever", etc. If you find it too problematic, we may consider a feature request that would allow optional prefixing of all metadata extracted from document parsing. This could greatly help eliminate collision too.
AFAIC I'm quite paranoïd about user doing bad things with my code, so if it would be me, I would have segregated user-generated stuff from mine :) Since the user can Copy/Rename application-level-tags to his/her namespace that should not cause any problem. Whatever, thanks for the efforts you spend to explain, it makes the software very easy to use :)
Hi,
I'm using norconex-collector-http-2.5.1 with lib/norconex-importer-2.6.0-SNAPSHOT.jar
In the crawler configuration file below, you'll find that DESCRIPTION and DESCRIPTION-TEST3 have exactly the same definition/selector. However, in the resulting "meta" file, I get different results for these tags:
Why do I get 8 times the values for DESCRIPTION?