berinhard / model_mommy

No longer maintained, please migrate to model_bakery
http://model-bakery.readthedocs.org/
Other
903 stars 141 forks source link

CharFields are only filled with ASCII characters #270

Closed mbox closed 8 years ago

mbox commented 8 years ago

The default generator for CharFields and other strings only generates values containing ASCII characters.

This hides bugs when handling unicode values e.g. when failing to encode/decode UTF-8 correctly.

Since both Python and Django support unicode strings, it would be much better if the default was to generate strings that contain characters not in ASCII, so any errors handling unicode show up.

berinhard commented 8 years ago

@mbox in my opinion model_mommy's purpose is to define random values for a field type. In this case, it does so when using only ASCII characters without problems. The point is that if the developer needs to ensure encoding errors or validations, it is his/her responsability and not model_mommy's. Besides that, if we do change to use UTF-8 we should be able to comprehend other encondings too and it seems to me that this is outside model_mommy context.

@vandersonmota what do you think?

vandersonmota commented 8 years ago

What @berinhard said

mbox commented 8 years ago

If I understand what you're saying, you feel that model_mommy's job is to fill a field with a valid value and it doesn't matter what that value is? The existing code goes to some lengths to make sure it generates a variety of different values. Why does it do this if the idea isn't to find some bugs that would otherwise be hidden?

Django's CharField is explicitly designed to store and return unicode strings - hence the values used to fill a CharField should be unicode. While ASCII is a strict subset of unicode, only using ASCII values is the same as only ever filling an IntegerField with 1, or a string with A.

model_mommy doesn't do that - so why not fill CharFields with something that reflects the real range of values that a program can accept? Note this isn't finding specific bugs - I agree that coders should add explicit tests for things they know need testing. Nor does it require picking a specific encoding.

Here's a generator that does what I'm suggesting:

def gen_unicode(max_length):
    try:
        get_char = unichr
    except NameError:
        get_char = chr

    # Update this to include code point ranges to be sampled
    include_ranges = [
        (0x0021, 0x0021),
        (0x0023, 0x0026),
        (0x0028, 0x007E),
        (0x00A1, 0x00AC),
        (0x00AE, 0x00FF),
        (0x0100, 0x017F),
        (0x0180, 0x024F),
        (0x2C60, 0x2C7F),
        (0x16A0, 0x16F0),
        (0x0370, 0x0377),
        (0x037A, 0x037E),
        (0x0384, 0x038A),
        (0x038C, 0x038C),
    ]

    alphabet = [
        get_char(code_point) for current_range in include_ranges
        for code_point in range(current_range[0], current_range[1] + 1)
        ]
    return ''.join(random.choice(alphabet) for i in range(max_length))

gen_unicode.required = ['max_length']

mommy.default_mapping[CharField] = gen_unicode

Why would using this be worse than using random.choice(string.ascii_letters)?