akhoury / nodebb-plugin-import

migrate your old crappy forum to NodeBB
MIT License
78 stars 31 forks source link

Some special characters not supported? Polish MyBB import #224

Closed Neurovert closed 5 years ago

Neurovert commented 5 years ago

I'm following the readme guide of the importer, but the special characters keep importing as:

U+FFFD - REPLACEMENT CHARACTER used to replace an unknown, unrecognized or unrepresentable character

Example post:

image

Of course I'm following the readme guide of the importer. Exporter used: https://github.com/ASCIIcat/nodebb-plugin-import-mybb

I've tried converting to different encoding before import, but it didn't help. After seeing #108 I assumed that the problem is on the mybb importer side... Hm I didn't find any code relating to charset/encoding there so I'm back here. Maybe the myBB exporter is incompatible with current main plugin? Later, I'll setup a different forum engine where I'll use Polish characters and then try to import.

I'm hoping you can help me get this to work properly @akhoury @ASCIIcat if you're still active, maybe you can also help.

If this is on the mybb exporter side, let me know and I'll try to write my own version. Thanks

Neurovert commented 5 years ago

I've just checked the SMF exporter on a fresh SMF installation. Polish characters unsupported - same problem.

Since the SMF exporter is the most recently done, I'm assuming the problem is on the importer plugin side.

akhoury commented 5 years ago

are you using a parseBefore setting? if so, can i see it the code block?

Neurovert commented 5 years ago

I'm not using it. My data is in UTF-8 so it shouldn't be needed. I've also verified the validity of the strings in database, and everything is fully valid, so the black question mark diamonds aren't my fault :/

If that matters - I haven't used any of the post-import tools yet, because black diamonds cannot be fixed by them.

Neurovert commented 5 years ago

I've solved the problem @akhoury I'll share the fix and findings ASAP, but I'm informing you now so you wouldn't waste your time ;)

The problem is in the mysql module. My database server had latin2 encoding and the database itself the UTF8MB4, but mysql module stupidly uses the server's encoding.

I worked around this by dirty adding a query enforcing my database encoding before any other queries - right after connection is established, so in case of -mybbex exporter:

    var getGroups = function(callback) {
        var query = "SET names 'utf8mb4'"
...

Here is how I found this: https://github.com/mysqljs/mysql/issues/804#issuecomment-431841416

I guess this can be implemented into the importer as a field in the template where user would put the old database charset. I don't have the time now to propose the code for it.

akhoury commented 5 years ago

good to know ! thanks for sharing !