Closed platbr closed 6 years ago
https://github.com/dmendel/bindata/wiki/FAQ#how-do-i-use-string-encodings-with-bindata
Also, your test string 'ÃÃÃÃÃ'
is 10 bytes in length, not 5.
@dmendel Your are right, but i failed to explain the real problem. Im importing a TXT based on position and it was in UTF-8. A field can have multibyte characters but it counts just as ONE.
BinData uses io.readbytes(len), when it reach a multibyte character it seek wrong, because it reads bytes, not characters.
Ex:
string :id, length: 2
string :a_utf_8_field, length: 5
string :other_field, length: 2
01ÃBBBB00
02ÃÃBBB00
Once a_utf_8_field can have a variable number of bytes, i cant define a length based on bytes.
I solved this problem converting this TXT to ISO8859-1, that don't have multibyte characters and created a class a type that read ISO8859-1 and output UTF-8.
require "bindata/base_primitive"
module BinData
class ISOtoUTF8String < BinData::BasePrimitive
arg_processor :string
optional_parameters :read_length, :length, :trim_padding, :pad_front, :pad_left
default_parameters pad_byte: "\0"
mutually_exclusive_parameters :read_length, :length
mutually_exclusive_parameters :length, :value
def initialize_shared_instance
if (has_parameter?(:value) || has_parameter?(:asserted_value)) &&
!has_parameter?(:read_length)
extend WarnNoReadLengthPlugin
end
super
end
def assign(val)
super(iso88591_string(val))
end
def iso88591_string(str)
str.force_encoding('iso-8859-1')
end
def snapshot
# override to trim padding
snap = super
snap = clamp_to_length(snap)
if get_parameter(:trim_padding)
trim_padding(snap).encode('utf-8')
else
snap.encode('utf-8')
end
end
#---------------
private
def clamp_to_length(str)
str = iso88591_string(str)
len = eval_parameter(:length) || str.length
if str.length == len
str
elsif str.length > len
str.slice(0, len)
else
padding = (eval_parameter(:pad_byte) * (len - str.length))
if get_parameter(:pad_front)
padding + str
else
str + padding
end
end
end
def trim_padding(str)
if get_parameter(:pad_front)
str.sub(/\A#{eval_parameter(:pad_byte)}*/, "")
else
str.sub(/#{eval_parameter(:pad_byte)}*\z/, "")
end
end
def value_to_binary_string(val)
clamp_to_length(val)
end
def read_and_return_value(io)
len = eval_parameter(:read_length) || eval_parameter(:length) || 0
io.readbytes(len)
end
def sensible_default
""
end
end
class StringArgProcessor < BaseArgProcessor
def sanitize_parameters!(obj_class, params)
params.warn_replacement_parameter(:initial_length, :read_length)
params.must_be_integer(:read_length, :length)
params.rename_parameter(:pad_left, :pad_front)
params.sanitize(:pad_byte) { |byte| sanitized_pad_byte(byte) }
end
#-------------
private
def sanitized_pad_byte(byte)
pad_byte = byte.is_a?(Integer) ? byte.chr : byte.to_s
if pad_byte.bytesize > 1
raise ArgumentError, ":pad_byte must not contain more than 1 byte"
end
pad_byte
end
end
# Warns when reading if :value && no :read_length
module WarnNoReadLengthPlugin
def read_and_return_value(io)
warn "#{debug_name} does not have a :read_length parameter - returning empty string"
""
end
end
end
Maybe to create a UTF-8 String Type that seek based on characters should be more elegant.
Tks for share your work.
Is your data source binary or text? If it doesn't have encoded numbers then it's a text format and you'd be better of using a text based parser such as Racc
.
If it's a binary format can you share the description? I haven't yet come across a binary format that references UTF-8 strings as number of characters rather than number of bytes.
Is there any way to make bindata works with UTF-8 multibyte characters (accents)?
I was expecting it returns "ÃÃÃÃÃ".
What i need to do ?