UTF-8 multibyte characters

platbr commented 6 years ago

Is there any way to make bindata works with UTF-8 multibyte characters (accents)?

obj = BinData::String.new(length: 5)
obj.read('ÃÃÃÃÃ')
puts obj.force_encoding('UTF-8')
ÃÃ�

I was expecting it returns "ÃÃÃÃÃ".

What i need to do ?

dmendel commented 6 years ago

https://github.com/dmendel/bindata/wiki/FAQ#how-do-i-use-string-encodings-with-bindata

Also, your test string 'ÃÃÃÃÃ' is 10 bytes in length, not 5.

platbr commented 6 years ago

@dmendel Your are right, but i failed to explain the real problem. Im importing a TXT based on position and it was in UTF-8. A field can have multibyte characters but it counts just as ONE.

BinData uses io.readbytes(len), when it reach a multibyte character it seek wrong, because it reads bytes, not characters.

Ex:

string  :id, length: 2
string :a_utf_8_field, length: 5
string :other_field, length: 2

01ÃBBBB00
02ÃÃBBB00

Once a_utf_8_field can have a variable number of bytes, i cant define a length based on bytes.

I solved this problem converting this TXT to ISO8859-1, that don't have multibyte characters and created a class a type that read ISO8859-1 and output UTF-8.

require "bindata/base_primitive"

module BinData
  class ISOtoUTF8String < BinData::BasePrimitive
    arg_processor :string

    optional_parameters :read_length, :length, :trim_padding, :pad_front, :pad_left
    default_parameters  pad_byte: "\0"
    mutually_exclusive_parameters :read_length, :length
    mutually_exclusive_parameters :length, :value

    def initialize_shared_instance
      if (has_parameter?(:value) || has_parameter?(:asserted_value)) &&
          !has_parameter?(:read_length)
        extend WarnNoReadLengthPlugin
      end
      super
    end

    def assign(val)
      super(iso88591_string(val))
    end

    def iso88591_string(str)
      str.force_encoding('iso-8859-1')
    end

    def snapshot
      # override to trim padding
      snap = super
      snap = clamp_to_length(snap)

      if get_parameter(:trim_padding)
        trim_padding(snap).encode('utf-8')
      else
        snap.encode('utf-8')
      end
    end

    #---------------
    private

    def clamp_to_length(str)
      str = iso88591_string(str)

      len = eval_parameter(:length) || str.length
      if str.length == len
        str
      elsif str.length > len
        str.slice(0, len)
      else
        padding = (eval_parameter(:pad_byte) * (len - str.length))
        if get_parameter(:pad_front)
          padding + str
        else
          str + padding
        end
      end
    end

    def trim_padding(str)
      if get_parameter(:pad_front)
        str.sub(/\A#{eval_parameter(:pad_byte)}*/, "")
      else
        str.sub(/#{eval_parameter(:pad_byte)}*\z/, "")
      end
    end

    def value_to_binary_string(val)
      clamp_to_length(val)
    end

    def read_and_return_value(io)
      len = eval_parameter(:read_length) || eval_parameter(:length) || 0
      io.readbytes(len)
    end

    def sensible_default
      ""
    end
  end

  class StringArgProcessor < BaseArgProcessor
    def sanitize_parameters!(obj_class, params)
      params.warn_replacement_parameter(:initial_length, :read_length)
      params.must_be_integer(:read_length, :length)
      params.rename_parameter(:pad_left, :pad_front)
      params.sanitize(:pad_byte) { |byte| sanitized_pad_byte(byte) }
    end

    #-------------
    private

    def sanitized_pad_byte(byte)
      pad_byte = byte.is_a?(Integer) ? byte.chr : byte.to_s
      if pad_byte.bytesize > 1
        raise ArgumentError, ":pad_byte must not contain more than 1 byte"
      end
      pad_byte
    end
  end

  # Warns when reading if :value && no :read_length
  module WarnNoReadLengthPlugin
    def read_and_return_value(io)
      warn "#{debug_name} does not have a :read_length parameter - returning empty string"
      ""
    end
  end
end

Maybe to create a UTF-8 String Type that seek based on characters should be more elegant.

Tks for share your work.

dmendel commented 6 years ago

Is your data source binary or text? If it doesn't have encoded numbers then it's a text format and you'd be better of using a text based parser such as Racc.

If it's a binary format can you share the description? I haven't yet come across a binary format that references UTF-8 strings as number of characters rather than number of bytes.

dmendel / bindata

UTF-8 multibyte characters #104