Nordic special characters in the metadata

jnsjak commented 4 years ago

We’re having a problem with special characters in the importer.

We use a ScriptTransformer to extract certain patterns from the content and insert the extracted values as metafields on the document to index. This works very well, however whenever these values contains special characters (we’re in Scandinavia so there’s a lot of those), they get converted to Java Literals (\uXXXX) in the output. So “smörgåsbord” becomes “sm\u00F6rg\u00E5sbord” in the .meta file, it is not changed in the parsed .cntnt however.

In fact this happens if I just set a constant tagger like so:

<constant name="SpecialChars">smörgåsbord</constant></tagger>

Which results in: SpecialChars = sm\u00F6rg\u00E5sbord in the .meta file.

The content being fetched is UTF-8 and the response header is set to utf-8, sourceCharset is set to UTF-8, xml-configuration is UTF-8, -Dfile.encoding also to UTF8. How can we prevent the special chars from being represented in this java format? Please let me know if you need more info to recreate, but basically it happens to me if I download the minimum configuration, point the URL to www.example.com, add the tagger above and use the filecommitter.

The content is ingested into an IDOL instance and that is not able to handle such encoding and thus the searches break and we need to find a way around it.

Thanks and hope you are able to help :)

essiembre commented 4 years ago

Are you using the IDOL Committer? If so, these characters should not be making it to IDOL that way.

https://norconex.com/collectors/committer-idol/

See, what you are referring to (the .meta files) are files used internally by the crawler as its "working documents" if you want. While the FileSystemCommitter offers to write documents in that same internal format, it is not the most convenient for anything but troubleshooting. Typically, a more appropriate committer should be used instead. If you really have to though, loading that file in a java.util.Properties will automatically unescape them.

My recommendation would be to use the IDOL committer directly, or if you have a need to store the content as files first for whatever reason and use another process that commits to IDOL, use something like the JSONFileCommitter or the XMLFileCommitter, which should make your life easier.

jnsjak commented 4 years ago

Yes, we are using the IDOL-committer, switching to the filecommitter was only in order to see what was happening. I follow when you say these escaped unicodes are only internal and I suppose then that the problem lies somewhere later in the process, as it's those same characters that are missing in the indexed content. I've investigated further and when I catch and open the file that the committer or importer constructs, stores on disk (the XXXX.data idx file) and passes to IDOL with the DREADD command, it looks like it is in ANSI/ASCII format (that's what Notepad++ and IE reports) though the input content from the webpages is UTF-8 (the output I get from the Filecommitter is UTF-8). If I change the language types used in IDOL to ASCII from UTF-8, the characters are correctly ingested and searchable, I very much fear however that this will create problems for us, so I would much prefer to keep things in UTF-8.

Is there some platform setting I can change that controls the encoding output of the IDOL committer?

On a side note, this problem only happens when the committer is pointed directly at an index port of an IDOL instance, not if the content is send to a CFS in front of the same idol instance. From the cfs-logs there seems to be some encoding detection and correction going on inside the CFS before the content is passed on to IDOL.

essiembre commented 4 years ago

The IDOL Committer sends it as UTF-8. IDOL is likely expecting a non-UTF8 format or does not detect and assumes a non-UTF8 by default. Can you change your IDOL instance to accept UTF-8 by default?

Since you confirmed CFS is doing encoding detection, to make it work, I would assume it adds some language/encoding field (or converts the encoding). You may want to add that field yourself (you can use the Importer ConstantTagger to set that field to UTF-8. Here is a suggested approach:

First, add a constant to your document with ConstantTagger called MyLanguageType=englishUTF8.

Then, in your IDOL config, have something like this (exact syntax may vary):

LanguageType=true
PropertyFieldCSVs=MyLanguageType
...
[LanguageTypes]
DefaultLanguageType=englishUTF8
...
[english]
Encodings=UTF8:englishUTF8

If CFS actually converts it to a non-UTF8 charset (which I doubt), maybe find out what that expected charset is, and convert it yourself with the Importer CharsetTransformer. I would stay away from that last option if you can avoid it.

pbrisson75 commented 4 years ago

A simple test would be to write the output of the CFS and compare it with the output of the committer, I suspect the CFS is doing some autodect work.

To write the file out of the CFS, add the following to your CFS config:

[ImportTasks] Pre0=IdxWriter:/mypath/in.xml Post0=IdxWriter:/mypath/out.xml

in.xml will contain what the CFS receives out.xml will contain what it sends to IDOL

jnsjak commented 4 years ago

Thanks for your suggestions.

Regarding setting the language type, then yes that's basically what I'm doing. This use case here is our synonym system for searches, where the authors specify the language for the word sets at time of creation, so we know and want to control the languagetype setting for the content. Since the input is UTF8 we use a scripttagger and set the language to englishUTF8, swedishUTF8, polishUTF8 etc. depending the language metadata for the content. The default languagetype of the IDOL engine is set to englishUTF8. It's in this situation I get messed up special characters when I point the committer to the IDOL indexport but correct ones when I point the same setup to a cfs in front of the same IDOL instance.

As said if I set the languagetype encoding to ASCII (swedishASCII, polishASCII etc.) special characters are correctly represented when using non-cfs, except some select eastern European ones, which makes this approach problematic.

The ouput of the idxWriters are correct UTF-8. The logs of the cfs reports that it's detecting the incoming input content as UTF-8, so I don't think it's actually doing any conversion, only detection. I’ve also tried to do my own CharsetTransformation in the importer without success.

One thing I notice though is that the content being send to the CFS from the committer seems to be posted over a socket as an ingest action ("action=ingest&adds=<adds><add><document><reference>http://..."), whereas the content being send directly to the idol instance indexport is created as an idx file in the idol instance /index/status folder and added to IDOL with a DREADD index action.

The idx files created by these two processes have different encodings, the idx file produced by the cfs from the ingest action is UTF-8, whereas the idx file produced by the committer appears to be something like iso-8859-1 or windows1252 of the otherwise same content. I’ve included the two intercepted files for reference. 8944.data_IDX_Submitted_to_cfs_first.txt 8945.data_IDX_Submitted_to_idol_indexport.txt

It seems like I might have to place a cfs in front of idol, but I would much prefer not to. Is there some way I can control the encoding type on this idx file outputted by the committer?, does it require a change in the committer or do you think something completely different is happening?

essiembre commented 4 years ago

Thanks for your thorough troubleshooting. You are probably on to something. I will have a second look and try to provide a fix.

essiembre commented 4 years ago

I think we have a fix. Can you please try the latest snapshot version of the IDOL Committer and confirm? https://norconex.com/collectors/committer-idol/download

jnsjak commented 4 years ago

Yes I can confirm, now I get UTF8 all the way through the process and into idol. Thanks for the swift fix!

Kind regards Jens.

Norconex / committer-idol

Nordic special characters in the metadata #2