Provide String to ByteString conversion using ASCII encoding

Kotlin / kotlinx-io

Kotlin multiplatform I/O library

Apache License 2.0

1.29k stars 58 forks source link

Provide String to ByteString conversion using ASCII encoding #170

Open SPC-code opened 1 year ago

SPC-code commented 1 year ago

In protocol parsing/writing we frequently need to operate with one-byte encoded strings that a expected to consist only of ASCII characters. Please add an ability to convert a string literal to a ByteString using character-to-byte transformation with check for non-ASCII characters.

fzhinkin commented 1 year ago

@SPC-code would something like these work for you?

fun String.encodeToAsciiByteString(): ByteString {
    val bstr = this.encodeToByteString()
    if (bstr.size != length) throw IllegalArgumentException("String is not an ASCII string: $this")
    return bstr
}

fun String.encodeToAsciiByteString(): ByteString {
    return buildByteString(length) {
        this@encodeToAsciiByteString.forEach { 
            if (it.code > Byte.MAX_VALUE || it.code < Byte.MIN_VALUE) {
                throw IllegalArgumentException("Character could not be encoded using ASCII: $it")
            }
            append(it.code.toByte())
        }
    }
}

SPC-code commented 1 year ago

I've done it differently: https://github.com/SciProgCentre/dataforge-core/blob/2aba1b48dce011906231ba5ab67353f9901cadfa/dataforge-io/src/commonMain/kotlin/space/kscience/dataforge/io/ioMisc.kt#L12-L19

But the important thing to have this API. Implementation could change in future.

lppedd commented 1 year ago

Plus an option for extended ASCII would be good to have.

fzhinkin commented 1 year ago

@lppedd could you please elaborate what do you mean under "extended ASCII"?

lppedd commented 1 year ago

@fzhinkin I meant the standard ASCII + the other 128 code points.
But I forgot that the extended part (the additional 128) is not standard, although maybe the general consensus is on the Windows-1252 or ISO 8859-1 charsets.

fzhinkin commented 1 year ago

I believe that such scenarios require explicit encoding routine that will use Windows-1252 or some other 8-bit encoding. Silently falling back to some default charset encoding is not a great option as it allows to encode potentially incorrect data without noticing a problem. And at the moment there are no particular plans on supporting charset encodings other then UTF-8.