amateescu / search_api_solr

11 stars 14 forks source link

Avoid field names that are not valid java identifiers #62

Closed mkalkbrenner closed 9 years ago

mkalkbrenner commented 9 years ago

After the discussion in https://www.drupal.org/node/2354271 there was a decision to use a slash ('/') as IndexInterface::DATASOURCE_ID_SEPARATOR. But like the previous separator that leads into different trouble in combination with Solr. Some parts are described at https://issues.apache.org/jira/browse/SOLR-3996

These problems affect you in different ways when you use field names containing a slash within 'q' or 'fl' or ... , for example if you query the index directly like this: http://localhost:8983/solr/core1/select?q=_%3A_&fl=tm_entity%24my_entity%2Ffield_text&wt=json In this case you get the right amount of documents, but just empty documents.

The solution is to replace all characters which are not valid java identifiers.

amateescu commented 9 years ago

Thanks for the detective work and the fix!

drunken-monkey commented 9 years ago

You're right, the slash should of course be esacped in Solr, too. However, it's very unfortunate that we now have potentially two different incompatible characters. The swap "colon to dollar sign" was easy to make and kept identifiers unique. However, if we represent both colons and slashes as dollar signs, there could, in theory, be clashes where two different field identifiers are mapped to the same Solr field identifier – which would, of course, be bad. Thus, we might want to do something different – e.g., encode the slash as two dollar signs, or to encode a colon as $c and a slash as $s (or $c$/$s$, to look a bit nicer). We could of course also just stick with the dollar sign for both, since it's pretty improbable that such a clash would really occur. However, with tens of thousands of installations, the situation is almost bound to pop up at some point. People do crazy shit. ;)

In any case, I'd explicitly do a str_replace() with the two offending characters, not a blanket preg_replace() for all non-fitting ones. It should be guaranteed by the Search API that field identifiers will only contain those characters anyways – and if that ever changes, we'd better make an explicit decision about what to do with the additional characters. (My second proposal, of dollar signs initiating a two-character escape, was with that possibility in mind, since it would allow easily extending this for other special characters.)

mkalkbrenner commented 9 years ago

I understand the concerns addressed by @drunken-monkey . But introducing explicit mappings like $c$ will rapidly turn into an endless story if you think about all these new nice unicode characters out there! So if we want to ensure unique filed names, we need to come up with a generic solution. I suggest to replace all required chars by their bytecode wrapped in $ signs. For example '/' becomes '$2f$'. To guarantee a collision free solution we must also treat the '$' sign itself identically, it will become '$24$.

mkalkbrenner commented 9 years ago

It's even worse. '$' are also not allowed in field names. See https://cwiki.apache.org/confluence/display/solr/Defining+Fields. I addressed everything in pull request #67