clintongormley / Elastic-Model

Use ElasticSearch as a NoSQL database in Perl
9 stars 8 forks source link

Clarify undef behaviour for Bool type fields #26

Closed timbunce closed 11 years ago

timbunce commented 11 years ago

The Elastic::Manual::Attributes docs say:

Note: By default, ElasticSearch treats undef as NEITHER true NOR false, but as null (ie missing). To work around this, we automatically set "null_value" to 0 to make boolean fields more Perlish. If you would like to revert to the default behaviour, set "null_value" to undef.

I don't understand the "make boolean fields more Perlish" comment, or the need to "work around this" for that matter. Perl has undef specifically to be able to distinguish null/missing from false. So does ElasticSearch. It seems you're fighting against what's natural for both.

Maybe I'm missing something, in which case the docs need expanding to include a more detailed explanation of the issues.

Also, the docs for null_value don't mention the use-case for Bool. In fact they argue against the case made in the Bool docs:

This option is included for completeness, but isn't very useful. Rather just leave the value as undef and use the exists and missing filters when you need to consider undef values.

clintongormley commented 11 years ago

Elasticsearch can't represent a null or undef value. Each field can store zero or more values. But if you try to index null|undef then it stores zero values.

A search for false against a boolean field with zero values will fail. For this reason we setup Bool fields to replace null with 0 in the index, to ensure that you can still use undef in your classes, and have it indexed correctly in Elasticsearch.

I'll try to update the docs to explain that better.

clintongormley commented 11 years ago

To add to that - while, yes, Perl uses undef to signify missing, in this context we're talking about a Bool value which has two values: true or false. But setting it to undef or setting it to 0 would result in different query results - not what the Perl user would expect.

timbunce commented 11 years ago

Elasticsearch can't represent a null or undef value. Each field can store zero or more values. But if you try to index null| | undef then it stores zero values.

Okay, I get that. (I'd drop the "But".) Well worth adding to the docs. Perhaps a whole section on undef that Bool etc could refer to.

A search for false against a boolean field with zero values will fail.

That behaviour matches SQL behaviour and is what I'd expect. (Perhaps this at the heart of the issue.)

For this reason we setup Bool fields to replace null with 0 in the index, to ensure that you can still use undef in your classes, and have it indexed correctly in Elasticsearch.

I still don't see the validity of the reason.

in this context we're talking about a Bool value which has two values: true or false.

What about 'Maybe[Bool]' or E:M:D->new({ hash with missing Bool field }) ?

ElasticSearch supports 'missing' values. Effectively all fields are 'nullable' from a traditional DB perspective.

But setting it to undef or setting it to 0 would result in different query results - not what the Perl user would expect.

I'd argue that someone familiar with SQL would, or at least should, expect different query results.

clintongormley commented 11 years ago

Indexing a doc with no value for a Bool field will still result in it being missing - it won't index the null_value instead. (see example below).

I would argue (and a quick survey in the #moose channel supports my view) that a Perl user who sets a Bool field to undef is expecting it to be queryable as false. Acceptable values for the Bool type constraint in Moose are: undef, "",0,1 while Maybe just adds undef to the list of allowed values (in this case, redundantly).

You can still not-store a value:

$obj = Foo->new();             # null
$obj->bool(undef);              # false
$obj->clear_bool;                # null

So I think that for Bool fields (or even Maybe[Bool]), accepting undef as false is more in line with Perl truthiness, and less likely to cause surprise, than the alternative.

I will, however, update the documentation to explain this more clearly.

I could also provide (and document) a custom type Bool_or_Null which does distinguish between undef and 0. It would do this by mapping the field as { type => 'boolean'} without the null_value setting.

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "mappings" : {
      "test" : {
         "properties" : {
            "bool" : {
               "null_value" : 0,
               "type" : "boolean"
            }
         }
      }
   }
}
'

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
   "bool" : null
}
'

curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1' -d '
{}
'

curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
{
   "bool" : 0
}
'

curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
{
   "bool" : 1
}
'

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "facets" : {
      "bool" : {
         "terms" : {
            "field" : "bool"
         }
      }
   },
   "size" : 0
}
'

# {
#    "hits" : {
#       "hits" : [],
#       "max_score" : 1,
#       "total" : 4
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "facets" : {
#       "bool" : {
#          "other" : 0,
#          "terms" : [
#             {
#                "count" : 2,
#                "term" : "F"
#             },
#             {
#                "count" : 1,
#                "term" : "T"
#             }
#          ],
#          "missing" : 1,
#          "_type" : "terms",
#          "total" : 3
#       }
#    },
#    "took" : 1
# }
timbunce commented 11 years ago

Indexing a doc with no value for a Bool field will still result in it being missing - it won't index the null_value instead. (see example below).

So null_value applies only to an explicit (present) undef not an implicit (missing) one. Worth clarifying in the null_value docs.

I would argue (and a quick survey in the #moose channel supports my view) that a Perl user who sets a Bool field to undef is expecting it to be queryable as false.

From a pure-Moose perspective I could agree with that for plain 'Bool' (except that I'd expect Moose to throw a type constraint error), but from a database perspective I can't.

More specifically, I'm concerned that in order to simplify querying data has been altered when stored. That data loss is a big problem. (See below for an example.)

Acceptable values for the Bool type constraint in Moose are: undef, "",0,1 while Maybe just adds undef to the list of allowed values (in this case, redundantly).

It's the 'redundantly' that we're arguing over :)

Here's a key question: given a Moose attribute which isa=>'Maybe[Type]', which other basic Moose types do not store an undef as undef?

I was rather surprised to find that the built-in Moose Bool type allows, stores, and returns undef even without a Maybe[]:

use Moose; use Data::Dumper; has b => (is => 'rw', isa => 'Bool'); warn Dumper(main->new(b => undef)->b);

I'd be tempted to call that a Moose mis-feature (I'd expect a type constraint error like I'd get for 'Int') but, either way, getting an undef back supports my position that in E:M 'Maybe[Bool]' shouldn't coerce.

You can still not-store a value: $obj = Foo->new(); # null $obj->clear_bool; # null

That's fine (and is what I'm doing as a workaround) but requires unnatural steps.

So I think that for Bool fields (or even Maybe[Bool]), accepting undef as false is more in line with Perl truthiness, and less likely to cause surprise, than the alternative.

Moose accepts undef as false but doesn't change it. My issue with E:M is that it chooses to change data on input as a strategy for simplifying a query use-case. I think that's the wrong place to address the problem.

Consider a database with boolean fields for things like has_feature_x and has_feature_y. True means yes, false means no, and NULL means we don't know. Here "we don't know" is very different from "false" and it's important not to loose that distinction. The developer writes some code to select the data from the database and store it in ES. The null "we don't know" values get silently corrupted into "false". That's just like taking a Red, Green, Blue enum and silently changing the Blues to Greens just because it's tricky to query Blues.

I could also provide (and document) a custom type Bool_or_Null which does distinguish between undef and 0. It would do this by mapping the field as { type => 'boolean'} without the null_value setting.

As should be clear from the above, I see the silent data changing behaviour of the existing Bool type as a serious bug. If it's kept it should have a very clear warning on the Bool docs. (Even then, I think it's the kind of thing that people will warn each other about as a significant "Gotcha" with E:M and spoil its reputation.)

What I'd like to see:

clintongormley commented 11 years ago

I agree that there is a conflict here between the two views: that of a Perl (and Moose) developer, and that of a DB person. Depending on which side you come from, one view is incorrect.

The Bool type in Moose has always accepted undef, presumably because Perl defines falsehood as undef, the empty string or 0 - http://perldoc.perl.org/perlsyn.html#Truth-and-Falsehood

Yesterday we discussed this in the #moose channel and the general feeling was that setting a Bool to undef should imply falsehood, rather than "not set", and that if you want a trinary type, you should be specific about it. As peregrin said "if I were using a Perl library I'd expect Perl truthiness."

Moose advocates using predicates and clearers ($obj->has_foo, $obj->clear_foo) for missing values, rather than setting them to undef.

There are a few uses of Maybe[Bool] on cpan - http://grep.cpan.me/?q=Maybe\s*\[\s*Bool - some of which are then used incorrectly.

So given that (a) Elastic::Model relies heavily on Moose and (b) we're trying to follow the princinple of least astonishment for Moose users,I still feel that, for the Bool type, we should regard undef as false.

That said, by specifically setting the type to Maybe[Bool] (even though in Moose this is redundant) it would make sense to regard undef as missing, as the specific use of Maybe indicates an intent.

What I'd like to see:

isa => 'Bool' should throw an exception on undef.

That is up to the Moose guys, but I think would break a huge amount of code, and they would be unlikely to accept that change.

isa => 'Maybe[Bool]' should store undef as null_value/missing.

Agreed - I'll implement that

document the need to take care when matching fields with null/missing values

Agreed