Data Import BUG - Githubissues

ralphkretzschmar commented 6 months ago

Dear Team, When i try to import testdata it identifies accounts as duplicates but they are not really duplicates.

Example data (csv-content): Organisationnames: Müller Muller Möller Moller When trying to import these organisation names as accounts it idetifies "Müller" as "Muller" -> false duplicates

So i can't import such data because of false duplicates

Best regards Ralph

ralphkretzschmar commented 6 months ago

After digging more into deep i recognized that it was the config of mysql db. Sorry for the false report.

Best regrads Ralph

ralphkretzschmar commented 6 months ago

Unfortunately its not only the DB i thought i have to convert it to another collation like "utf8mb4_bin" of the DB. but this has negative side effects. i guess the comparison for accountnames has to be done at code level?

urban-thinking commented 6 months ago

Hiya Ralph.

as we state in https://blog.crm-now.de/doc/berliCRM/installation/Installation_berlicrm.html the DATABASE COLLATION must be utf8_unicode_ci .

The second part of your question is not clear enough to answer, can you try to reword it?

Regards Emilio

urban-thinking commented 6 months ago

Ahh ... I talked to a colleague which understood your question. Let me give an AI answer ;-)

The utf8_unicode_ci collation in MySQL is a case-insensitive collation that supports the UTF-8 character set. It treats accented characters as equivalent to their non-accented counterparts. This behavior is by design and is intended to facilitate searches and comparisons where differences in accents or case should be ignored.

In the case of "müller" and "muller", the utf8_unicode_ci collation treats them as equivalent because it ignores the difference in the accent on the letter 'u'. This can be beneficial in many situations, such as when searching for names or words where accents might be inconsistently used or omitted.

If you want accent sensitivity in your searches, you would need to use a different collation that supports that, such as utf8mb4_bin, which is case-sensitive and accent-sensitive. However, it's worth noting that using accent-insensitive collations like utf8mb3_unicode_ci or utf8mb4_unicode_ci is often preferred for applications where users might input data inconsistently.

Regards Emilio

ralphkretzschmar commented 6 months ago

HI Emilio,

thank you for your really fast update :) (makes sense for me) i think to go with utf8mb4_unicode_ci is fin because of search funktions etc.

Do you know where to take a look at the code to implement a more granular double check for duplicates at importing data function?

What i try to implement is a check if the Accountname (which i try to import) is already existing (100% same check). So i could import import accounts like "Muller GmbH" and "Müller GmbH" as they are treated as different accounts and still have the the other benefits for inconsistently data input.

i could share my code afterwards -> could be interesting for german admin-users.

Best regards Ralph

Archibald111 commented 6 months ago

utf8mb4_unicode_ci is not ok, that is the source for your Umlaute problem

ralphkretzschmar commented 6 months ago

Hi Frank,

and which one should i use?

Best regards, Ralph

Archibald111 commented 6 months ago

as Emilio wrote utf8_unicode_ci

AlexKay85 commented 6 months ago

Hi Ralph,

'utf8_unicode_ci' will not help you with this issue, it treats Umlauts the same as 'utf8mb4_unicode_ci'.

What you'd need is either a binary collation like 'utf8_bin' or a typecast to binary for every comparison. We do not support binary collations, they were not tested at all and probably wouldn't work very well. Unfortunately it's not easy to fix this on the code level either. Too many places where it'd had to be done and it also opens another can of worms.

Best Regards, Alex

berliCRM / berlicrm

Data Import BUG #838