amateescu / search_api_solr

11 stars 14 forks source link

Use only alphanumeric or underscore characters for field names. Other… #67

Closed mkalkbrenner closed 9 years ago

mkalkbrenner commented 9 years ago

… characters will be converted in their hexadecimal representation.

amateescu commented 9 years ago

I'm going to wait for @drunken-monkey's opinion on this one, since he had a few good points on the previous PR :)

drunken-monkey commented 9 years ago

As said in #62:

It should be guaranteed by the Search API that field identifiers will only contain those characters anyways – and if that ever changes, we'd better make an explicit decision about what to do with the additional characters.

We don't need support for "fancy Unicode characters" or anything: we are dealing with Search API field identifiers, which only ever consist of alphanumeric characters, underscores, colons and (optionally) a single slash. Therefore, we only need to escape those last two characters – no general, automatic translation mechanism is necessary.

Regarding the acceptability of dollar signs: this "requirement" was totally undocumented a year ago, and when I asked in the mailing list they explicitly said that all Java identifiers are OK. I also wouldn't know of any potential clashes with Solr request syntax (which is what breaks compatibility for other characters) and haven't had any problems with dollar signs in fields so far. Since they make our lives a lot easier in this case, I'd want to make sure we'll get into problems if we restrict ourselves to field identifiers without them. So, maybe try asking in the solr-users mailing list whether anyone knows of any potential problems (now or future), where this new policy comes from, etc.? Otherwise, I'd be in favor of keeping our usage of dollar signs.

So, I stay with my original proposal(s): just use the $$ or something similar for slashes and be done with it.

Also, side note: why is there still the clean_ids option in the code? That, like the site_hash option, was only a temporary solution to ease updating. When switching to Drupal 8, it can safely be dropped, I'd say (since indexes won't be compatible out-of-the-box anyways).

mkalkbrenner commented 9 years ago

A generic solution doesn't mean that it's not expicit: _3a_ is : _2f_ is /

Using the hex representation of a char is expicit, generic, readable and automatically decodable without knowledge of a specific proprietary mapping.

And as @drunken-monkey already pointed out in #62

People do crazy shit. ;)

Therfore I won't relay on that:

We don't need support for "fancy Unicode characters" or anything: we are dealing with Search API field identifiers, which only ever consist of alphanumeric characters, underscores, colons and (optionally) a single slash.

It's no problem to overwrite anything in search_api 8.x. Personally I consider to add my own unique separator to identify languages in field names.

drunken-monkey commented 9 years ago

A generic solution doesn't mean that it's not expicit: 3a is : 2f is /

Using the hex representation of a char is expicit, generic, readable and automatically decodable without knowledge of a specific proprietary mapping.

I wouldn't say it's "readable" in any real sense, and certainly more bloated than using a simple dollar sign. I also wouldn't say an openly documented transformation is "proprietary", at least not as I understand the word. Certainly, the hex transformation is already in use, whereas our transformation would be specific to our use case – but then, the use case is also specific.

What definitely would make your proposed identifiers uglier than the current (or my proposed) ones is that you'd also (have to) escape underscores, which frequently occur in field identifiers, which I'd oppose if at all possible. That's why I'd really like to know if there's a real reason not to use dollar signs – with them, the hex transformation would be more or less equivalent to my proposed one. I'd still prefer my escape sequences, as they are more concise, but I'd be open to be convinced, if the majority is on your side.

It's no problem to overwrite anything in search_api 8.x. Personally I consider to add my own unique separator to identify languages in field names.

Then you're almost bound to run into the same problem with Search API/Drupal which you are trying to avoid here on the Solr side: using unsupported characters in identifiers. If we currently don't define what field names have to look like, we should do that – and I'd really be for not allowing more "weird" characters than absolutely necessary (i.e., colon and slash apart from the usual "word" characters).

mkalkbrenner commented 9 years ago

@drunken-monkey if you're sure about the $-sign, that it's no problem to use it, I suggest to modify our patch to use it like this: : becomes $3a$ / becomes $2f$ $ becomes $24$ _ is not touched

In this case we have all the advantages:

drunken-monkey commented 9 years ago

I'm not completely sure, no. My last information was that dollar signs were officially OK, and I have no idea why this new description was added. But I think it should be OK, and don't know of any reason why it wouldn't.

I still don't think we have to support arbitrary characters here, but I concur that this only inflicts a small cost in this implementation, so I'd be fine with it. What do others think here?

cspitzlay commented 9 years ago

I agree with the point that having to encode the underscore would make many field names less readable because a Drupal field name with underscores is really a common case.

On the other hand the current Solr docs at https://cwiki.apache.org/confluence/display/solr/Defining+Fields recommend

alphanumeric or underscore characters only and not start with a digit

and the two points

other field names will not have first class support from all components

and

back compatibility is not guaranteed"

do not sound good to me.

I also found this:

https://issues.apache.org/jira/browse/SOLR-2606?focusedCommentId=13073128#comment-13073128 One of the commentators says:

I've long advocated sticking to java identifiers (minus the $) as best practice for a whole bunch of reasons.

Unfortunately he doesn't state those reasons again.

I really like the suggestion of using dollar signs with hex escaping of any strange characters but we should be sure of what using the dollar sign would entail.

cspitzlay commented 9 years ago

Java itself seems to have deprecated the dollar sign as character in an identifier: http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.8

The "Java letters" include uppercase and lowercase ASCII Latin letters A-Z (\u0041-\u005a), and a-z (\u0061-\u007a), and, for historical reasons, the ASCII underscore (_, or \u005f) and dollar sign ($, or \u0024). The $ character should be used only in mechanically generated source code or, rarely, to access pre-existing names on legacy systems.

mkalkbrenner commented 9 years ago

From the official Java Tutorial at https://docs.oracle.com/javase/tutorial/java/nutsandbolts/variables.html

A variable's name can be any legal identifier — an unlimited-length sequence of Unicode letters and digits, beginning with a letter, the dollar sign "$", or the underscore character "_". ... Additionally, the dollar sign character, by convention, is never used at all. You may find some situations where auto-generated names will contain the dollar sign, but your variable names should always avoid using it.

In combination with @cspitzlay's comment above I don't have a good feeling about the $-sign. Therefore I recommend that we should go for the pull request as it is now. I understand that the field names are a bit ugly. But it seems to be the right way to be compatible in the future. Even if solr currently supports the $-sign, it could be that it will be deprecated sometimes because of Java itself. And the normal user won't get in touch with these field names on the solr level.

drunken-monkey commented 9 years ago

Just for clarification: the "definition" that all valid Java identifiers are valid Solr field names does not have any technical reason whatsoever. Even Java completely dropping support for dollar signs in identifiers wouldn't influence Solr field names in the least.

Also:

encodable and decodable by other systems that have to access the index without the need to follow changes in search_api_solr

That's not really an argument, as that's exactly what the getFieldNames() method is there for (and why it is public).

Bearing that in mind, we could also go the way of the DB Search module and just save our mapping internally. Then we could just map all "forbidden" characters to underscores and deal with conflicts as they arise.

And the normal user won't get in touch with these field names on the solr level.

That's also not completely true – see the handbook instructions on how field names are created. There were frequent questions about this before I wrote it down (and afterwards, too, of course, since nobody reads the f*ing manual), as there are some easy and common customizations for which you need to know a field's Solr identifier. Of course, most users still won't care about any of this, but I'd nevertheless consider it a plus to be able to avoid explaining hex codes in that section as well.

Anyways, we're getting nowhere with "bad feelings". I've now posted a question on the solr-user mailing list myself. Let's see what they say.

mkalkbrenner commented 9 years ago

Anyways, we're getting nowhere with "bad feelings". I've now posted a question on the solr-user mailing list myself. Let's see what they say.

Thanks Thomas for posting the question to the mailing list! Unfortunately the "bad feelings" still remain :-( The important parts from https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201507.mbox/%3CCAN4YXvfy1OpMro_EnNy-7NxQA6UZqkr9Y1yw30KpqJxpw5L9Rw@mail.gmail.com%3E are

Solr isn't guaranteed to play nice with naming conventions other than those specified on the page you referenced, alphanumerics and underscores and not starting with numerics. ... So you can use dollar signs, but there won't be any attempts to support it if some component somewhere doesn't "do the right thing" with it. And no guarantee that there aren't current corner cases where that causes problems. And if it does cause problems, support won't be added.

As already assumed, the de facto specification is alphanumerics and underscores, nothing else.

see the handbook instructions on how field names are created ... but I'd nevertheless consider it a plus to be able to avoid explaining hex codes in that section as well.

I'll help with that, if you want.

drunken-monkey commented 9 years ago

Well, there wasn't really anything new in that answer, just a re-hashing of what the Wiki page says. I'm still pretty confident that we wouldn't run into any problems with dollar signs, but I can understand wanting to be on the safe side here, and find a future-proof solution. So, if there are no other objections, feel free to go ahead with this PR.

I've also thought, maybe the proper solution for the "What's the field's Solr name?" problem would be an additional tab on the server page (like "Files" – or both tabs as children under a new top-level "Solr info" tab) which just lists the complete mapping. Would be simple to implement, and make things a lot easier for everyone involved.

amateescu commented 9 years ago

@drunken-monkey, you have project permissions on this repository too, so feel free to merge it :)

I like the Solr info idea a lot, btw.

drunken-monkey commented 9 years ago

I haven't really followed the D8 branch at all, so I'm probably the wrong one to review the code, nor am I set up to test it. I'm just here for the more general discussions.

And yes, once we merge it we should either also work on adding that info tab, or update the handbook entry with a D8-specific subsection (or both, really – if you can see the information on the site, we should say so in the handbook).

mkalkbrenner commented 9 years ago

I moved the encoding functionality to the re-usable function Utility::encodeSolrDynamicFieldName() and added the corresponding function Utility::decodeSolrDynamicFieldName(). I already need them in apachesolrmultilingual. I also improved the encoding strategy to be concise. The first `that splits the dynamic field prefix from the rest of the name is now encoded as well: tm_entity:node/bodybecomestm_5f_entity_3a_node_2f_body That eases the decoding a lot. And thanks to the descision to use the underscore as quoting character it still perfectly fits the declaration of dynamic fields in our schema: tm_5f_entity_3a_node_2fbodystill matchestm*`.

amateescu commented 9 years ago

Moving them to utility functions is a very good idea. Happy to see we have an agreement on these field names :)