Proposal for ascii - Githubissues

ivan-pi commented 4 years ago

This module should include functions for character classification and conversion (lower, upper). I have prepared a basic implementation at https://github.com/ivan-pi/fortran-ascii.

The plan is to cover the same functionality as found in the C, C++, and D libraries:

@Zbeekman has already opened an issue (see https://github.com/ivan-pi/fortran-ascii/issues/1) on dealing with different character kinds. The problem is that the ascii and iso_10646 character sets need not be supported by the compilers. Even if they are supported their bitwise representation might be different from the default kind.

I realized while creating these functions, that agreeing upon a style guide #3 and documentation #4 early on would be helpful to improve future pull requests. Some agreement upon unit testing will also be necessary.

cc: @jacobwilliams

ivan-pi commented 4 years ago

One of the simple points to discuss (connected to the style guide #3) is naming conventions. For example for the function that checks whether a character is a letter some of the possible names are:

is_alpha
isalpha (C/C++)
isAlpha (D language, but in Fortran it's the same as the previous suggestion)

Personally, I like the underscored version (it matches the general verbosity of Fortran).

I have also noticed that in C++, they have a separate set of the these function for "wide" strings (e.g. iswalpha, see https://en.cppreference.com/w/c/string/wide) which if I understand correctly are Unicode characters.

certik commented 4 years ago

@ivan-pi let's move the naming conventions discussion into https://github.com/fortran-lang/stdlib/issues/3#issuecomment-566302138, where I just commented.

zbeekman commented 4 years ago

To address character kinds, this is my comment from the ascii proto-type repo:

If I understand correctly you are suggesting I create several copies of these function to operate on the following character kinds:

Well, not exactly, because, as you noted, they're not guaranteed to exist, and when they do exist, "DEFAULT" is often/usually the same kind as "ASCII", so you can't create overloaded functions with arguments that are "ascii" and "default". (In that case you'd have a duplicate interface.)

That's one nice thing about jin2for: It doesn't assume anything and interrogates the numeric kinds from the compiler to then generate the code. So if only one character kind is supported then your code will only have that one kind. I'll cross post this on the new issue you made.

ivan-pi commented 4 years ago

I see what you mean. If "default" and "ascii" weren't the same kind, then I could just create generic interfaces for all three character kinds (with conditional compilation if the compiler supports UCS using CMake).

How likely are we to meet a processor where "ascii" is not equal to "default"? In Modern Fortran explained (2018) EBCDIC is mentioned. It seems to still be in use on some IBM mainframes: https://www.quora.com/Is-EBCDIC-replaced-by-ASCII-or-Unicode

Since the module is supposed to work for ascii characters, I think it is possible to write all the functions in such a way that they work with either "default" (non-ascii) or "ascii" straight of the box (as long as the processor supports ascii) using the achar and iachar conversion routines.

For the ascii subset of unicode characters it should be possible to create overloaded functions because they have a different kind (more bits). Since there are not so many functions in this module I could just do it manually. We can always bring in jin2for later.

zbeekman commented 4 years ago

For the current use case, I don't think it makes sense to use Jin2For either.

Let's start with default character kinds while we consider if there's anything clever that we can do. We can always do some introspection with CMake and then use configure_file() to achieve overloading.

ivan-pi commented 4 years ago

Is it possible to have the operating system somehow simulate "EBCDIC" defaults or maybe it is possible setup a Docker image? I have no means of testing non-"ascii" defaults at the moment.

ivan-pi commented 4 years ago

I think the way the procedures are written now (https://github.com/ivan-pi/fortran-ascii/blob/master/stdlib_ascii.f90) would work with both "default" and "ascii" characters.

ivan-pi commented 4 years ago

Before I submit a pull request, there are a few more issues worth discussing:

Should these procedure be pure or elemental?
How to handle control characters? Just as character constants? Or would it be better to simulate an enumerator using a derived type? A potential issue I see is that, the control characters constants might end up inadvertently polluting someones namespace if he does use stdlib_ascii and forgets to use the only clause. By wrapping them in a derived-type (singleton) you could only access these via something like ascii_control_char%vt (to access the vertical tab).
Naming convention; in the other thread https://github.com/fortran-lang/stdlib/issues/3 the general agreement was to keep the longer - more explanatory names - with the verb separated by an underscore. I have prepared a table of the functions, their proposed name in the Fortran stdlib and their names in other languages:
- I see some potential confusion between the is_white and is_blank functions, also given that they have a different name in C/C++. The is_white returns true for the characters including the space, tab, vertical tab, form feed, carriage return, and linefeed characters. The is_blank (or isblank in C) was standardized in C99 and returns true if the character is the space character ' ' or the tab character. It would be interesting to know, what was the reason to introduce this as a separate function.
- Personally, I am happy with the verbose underscored names, but I thought it is still worth discussing whether or not we would prefer the shorter seven-letter abbreviations.

Purpose	Fortran	C/C++	D	Python*
checks if a character is alphabetic	`is_alpha`	`isalpha`	`isAlpha`	`isalpha`
checks if a character is alphanumeric	`is_alphanum`	`isalnum`	`isAlphaNum`	`isalnum`
checks if a character is in the ASCII character set	`is_ascii`	/	`isASCII`	`isascii`
checks if a character is a control character	`is_control`	`iscntrl`	`isControl`
checks if a character is a digit	`is_digit`	`isdigit`	`isDigit`	`isdigit`
checks if a character is a octal character	`is_octal_digit`	/	`isOctalDigit`
checks if a character is a hexadecimal character	`is_hex_digit`	`isxdigit`	`isHexDigit`
checks if a character is a punctuation character	`is_punctuation`	`ispunct`	`isPunctuation`
checks if a character is a graphical character	`is_graphical`	`isgraph`	`isGraphical`
checks if a character is a printing character	`is_printable`	`isprint`	`isPrintable`	`isprintable`
checks if a character is an uppercase character	`is_upper`	`isupper`	`isUpper`	`isupper`
checks if a character is lowercase	`is_lower`	`islower`	`isLower`	`islower`
checks if a character is a whitespace character	`is_white`	`isspace`	`isWhite`	`isspace`
checks if a character is a blank character (space or tab)	`is_blank`	`isblank`	/
converts a character to lowercase	`to_lower`	`tolower`	`toLower`
converts a character to uppercase	`to_upper`	`toupper`	`toUpper`
				Note*: In Python these are bound methods of the built-in string type

Another issue is what should be the behavior for non-ASCII characters? In D the function returns false for non-ascii characters, and the toX functions do nothing. This is probably a separate issue to discuss, once we work out the best way to also support the UCS character set.

certik commented 4 years ago

Thanks @ivan-pi for submitting a PR with this! Much appreciated. Let's discuss the API.

jvdp1 commented 4 years ago

Thanks @ivan-pi for this nice implementation. Question: should the long cases (implemented in the test) be also present in the module? or do we expect that the user would implement them for their own use?

ivan-pi commented 4 years ago

Perhaps I have not understood your question correctly. The long tests generally loop through all characters (therefore they are "long") and are meant only as unit tests to verify the correctness of the functions. Since the short/long adjectives might lead to some confusion I would not mind if we rename them.

The last test - test_ascii_table - reproduces the table from the C++ functions available at https://en.cppreference.com/w/cpp/string/byte

Since the functions follow the same pattern what might be worth including in the module is an abstract interface:

 abstract interface
        pure logical function validation_func_interface(c)
            character(len=1), intent(in) :: c
        end function
 end interface

although apart from my test case which uses an array of procedure pointers to loop through the character validation routines, I don't really know if their would be any use cases.

jvdp1 commented 4 years ago

Since the short/long adjectives might lead to some confusion I would not mind if we rename them.

@ivan-pi I think they are fine. I was not clear enough. Here is an example of what I meant:

character(len=30)::examplelower,exampleupper
example='abc'
exampleupper=to_upper(examplelower)
print*,trim(exampleupper) !it prints 'ABC'

we can discuss and see later if it could be useful or not.

ivan-pi commented 4 years ago

This would be useful and indeed and I would not mind preparing some examples for users. The way I see this done with other languages/libraries is to add usage examples in the documentation string. We should work this out under issue #4 . These could be integrated into a documentation website, kind of like with Sympy or D (see here. I am sure @milancurcic or @certik have some ideas how to publish the API and documentation as a website in the future.

certik commented 4 years ago

@ivan-pi ideally in the future we could write doctests just like in Python, as I just commented at https://github.com/fortran-lang/stdlib/issues/4#issuecomment-568552286.

milancurcic commented 4 years ago

@ivan-pi Thank you for your work and sorry to be late to this thread, I missed reading it all the way through.

Should these procedure be pure or elemental?

Overall, if a pure procedure can be made also elemental I think there are only benefits to it and not penalties, so we should do it even when use cases are not apparent.

I just checked and confirmed that they can all be made elemental as well. We can do it in a separate PR if the community agrees.

ivan-pi commented 4 years ago

One downside of elemental procedures is that they cannot be used as procedure pointers (https://stackoverflow.com/questions/15225007/elemental-functions-cannot-be-pointed-to-by-procedure-pointers).

I am not sure whether this matters in practical usage cases or not.

milancurcic commented 4 years ago

Ah, okay, I didn't know. That seems like an important restriction to take into account. We may just have to consider elemental on a case by case basis.

In this case, I don't know the answer as I don't have much experience working with text in Fortran (this may change now that we have ascii module :)). I can imagine it being useful to feed an array of characters to to_upper or to_lower. I can't think of a use case of passing them as proc. pointer.

certik commented 4 years ago

We can start with elemental, and since we are still in "experimental", we can remove elemental if we discover issues.

milancurcic commented 4 years ago

Question regarding control characters. They are currently defined as public parameters:

! All control characters in the ASCII table (see www.asciitable.com).
character(len=1), public, parameter :: NUL = achar(z'00') !! Null
character(len=1), public, parameter :: SOH = achar(z'01') !! Start of heading
... ! 30 more parameters ommitted
character(len=1), public, parameter :: DEL = achar(z'7F') !! Delete

First, consider that a user may just do use stdlib_experimental_ascii, which imports everything, including the control characters (even though not recommended). All of them have names that are 2 or 3 characters long. Some of them could easily clash with other variable names because they're short and somewhat common, for example bs, lf, vt, ff, or cr. This scenario is okay (I think) because then the user is forced to do use stdlib_experimental_ascii, only: ..., which is recommended.

However, consider a user that wants to work specifically with control characters. Their only options are:

use stdlib_experimental_ascii (imports everything, not recommended)
use stdlib_experimental_ascii, only: list, every, single, control, character, here, ... (recommended, but tedious)

Alternatives would be to wrap these in a private derived type ascii_control_characters_type, and publicly expose a parameter instance of the type, ascii_control_characters. Then the use would do use stdlib_experimental_ascii, only: ascii_control_characters, and access them by, say, ascii_control_characters % BEL.

So, are we happy with the current API of control characters or is this a concern?

ivan-pi commented 4 years ago

I agree this is a concern.

Your suggestion is in line with one of my previous comments:

How to handle control characters? Just as character constants? Or would it be better to simulate an enumerator using a derived type? A potential issue I see is that, the control characters constants might end up inadvertently polluting someones namespace if he does use stdlib_ascii and forgets to use the only clause. By wrapping them in a derived-type (singleton) you could only access these via something like ascii_control_char%vt (to access the vertical tab).

Personally, I think we should go with your suggestion. In fact in D, they have something similar (https://dlang.org/phobos/std_ascii.html#ControlChar).

I can prepare a new pull request with your suggestion, modify the procedures to be elemental, and also replace the whitechar function used in the stdlib_experimental_io module.

ivan-pi commented 4 years ago

Today I was listing through the book "Migrating to Fortran 90" by James F. Kerrigan and on page 195 the author uses a derived type to encapsulate a set of cursor control strings useful for communication with an ANSI 3.64 compliant video display terminal.

Given this prior art, I think we can go ahead with the solution @milancurcic has suggested.

milancurcic commented 4 years ago

Thanks @ivan-pi. Another minor nit-pick. Should this:

pure logical function is_alphanum(c)
    character(len=1), intent(in) :: c !! The character to test.
    is_alphanum = (c >= '0' .and. c <= '9') .or. (c >= 'a' .and. c <= 'z') &
        .or. (c >= 'A' .and. c <= 'Z')
end function

be written as this?

pure logical function is_alphanum(c)
    character(len=1), intent(in) :: c !! The character to test.
    is_alphanum = is_digit(c) .or. is_alpha(c)
end function

ivan-pi commented 4 years ago

I agree the second version is cleaner and saves a few characters.

It would be interesting to compare the differences at assembly level with regard to different optimization flags.

jvdp1 commented 4 years ago

Should we think here about getting the most optimized code? Certainly some optimizations are needed. But, as a user, I would first focus on the ease of use. If I would need performance, I would most likely implement what I need myself.

Also, as a developer, I prefer the second solution of @milancurcic . If something must be changed, modifiying only is_digit and is_alpha would be enough, and the developer doesn't need to search for the same code in the code.

ivan-pi commented 4 years ago

I tried to create a benchmark to test the two versions above and it is more difficult than I expected 😦 . With no optimization flags and a low number of runs it seems like the second version is a bit slower as is it invokes two function calls. With -O3 even after reading a file with 150 MB (or 80000000 random characters) I cannot reliably say which version is faster as differences in code layout, CPU temperature, etc., create variations.

Let´s go for the second version then. I will make the changes in the next iteration of my PR after we agree what to do with the character constants in #49.

ivan-pi commented 4 years ago

Since we are already discussing implementation details I was wondering how the character classification functions are defined in the C library. They use a different approach, quoting from Wikipedia:

... the character classification routines are not written as comparison tests. In most C libraries, they are written as static table lookups instead of macros or functions.

This can be done in Fortran by setting up a constant array of 127 integers (say 16-bit) for the set of ascii characters (this could be done with a list of binary literals). The bit values are then used to indicate the different properties of a character (alphabetical, digit, punctuation, control, etc.).

For example if the first bit is used to represent whether the character is alphabetical or not, the is_alpha function would be:

elemental logical function is_alpha(c)
character, intent(in) :: c
integer :: ic
ic = iachar(c)
is_alpha = btest(table(ic),0)  ! access ascii character table
end function

I am not sure though what is the behavior of iachar if passed no ascii-characters, are these simply truncated or is the behavior processor-dependent.

Such a table can be easily generated using the "current" functions:

program gen_ascii_table
    use stdlib_experimental_ascii
    implicit none
    integer :: ascii_table(0:127)
    integer :: i
    character(len=1) :: c

    ! initialize all bits to zero
    ascii_table = 0

    do i = 0, 127
        c = achar(i)
        if (is_alpha(c))       ascii_table(i) = ibset(ascii_table(i),0)
        if (is_digit(c))       ascii_table(i) = ibset(ascii_table(i),1)
        if (is_alphanum(c))    ascii_table(i) = ibset(ascii_table(i),2)
        if (is_punctuation(c)) ascii_table(i) = ibset(ascii_table(i),3)
        if (is_control(c))     ascii_table(i) = ibset(ascii_table(i),4)
        if (is_graphical(c))   ascii_table(i) = ibset(ascii_table(i),5)
        if (is_printable(c))   ascii_table(i) = ibset(ascii_table(i),6)
        if (is_white(c))       ascii_table(i) = ibset(ascii_table(i),7)
        if (is_blank(c))       ascii_table(i) = ibset(ascii_table(i),8)
        if (is_lower(c))       ascii_table(i) = ibset(ascii_table(i),9)
        if (is_upper(c))       ascii_table(i) = ibset(ascii_table(i),10)
        if (is_octal_digit(c)) ascii_table(i) = ibset(ascii_table(i),11)
        if (is_hex_digit(c))   ascii_table(i) = ibset(ascii_table(i),12)
    end do
    write(*,'(A,128(I0,:,","))',advance='no') "[",(ascii_table(i),i=0,127)
    write(*,'(a1)') "]"
end program

The table of integers is then:

[16,16,16,16,16,16,16,16,16,400,144,144,144,144,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,16,448,104,104,104,104,104,104,104,104,104,104,104,104,104,104,104,6246,6246,6246,6246,6246,6246,6246,6246,4198,4198,104,104,104,104,104,104,104,5221,5221,5221,5221,5221,5221,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,1125,104,104,104,104,104,104,4709,4709,4709,4709,4709,4709,613,613,613,613,613,613,613,613,613,613,613,613,613,613,613,613,613,613,613,613,104,104,104,104,16]

As you can see different integers correspond to different character properties (e.g. 613 are lowercase letters which are note hex digits, 104 are punctuation characters, 6246 are octal digits...).

Is there any reason we would prefer to go down this bit route?

ivan-pi commented 4 years ago

I am not sure though what is the behavior of iachar if passed no ascii-characters, are these simply truncated or is the behavior processor-dependent.

Trying to compile the program

program test
use stdlib_experimental_ascii, only: is_alpha
implicit none
print *, iachar('ä'), iachar('à'), is_alpha('ä'), is_alpha('à')
end program

I get the following errors with gfortran:

test2.f90:4:16:

 print *, iachar('ä'), iachar('à'), is_alpha('ä'), is_alpha('à')
                1
Error: Argument of IACHAR at (1) must be of length one
test2.f90:4:30:

 print *, iachar('ä'), iachar('à'), is_alpha('ä'), is_alpha('à')
                              1
Error: Argument of IACHAR at (1) must be of length one

However for the is_alpha function which only accepts a character(len=1) the compiler does not flag an error. Any idea if this is supposed to be this way?

zbeekman commented 4 years ago

On my phone so I’ll be brief/terse and can’t easily look at the code.

is_alpha() might be elemental? Non-ascii text takes up multiple bytes, ascii just one byte. The compiler is seeing two (or more) byte sequences and thinks that they are multiple 1-byte ascii chars.

Also, FYI, IIRC: compilers may not be required to handle non-ascii characters by the standard. I seem to recall this to be true from when I implemented Unicode support in JSON-Fortran.

ivan-pi commented 4 years ago

is_alpha() might be elemental? Non-ascii text takes up multiple bytes, ascii just one byte. The compiler is seeing two (or more) byte sequences and thinks that they are multiple 1-byte ascii chars.

If I make the procedures pure instead of elemental, I am still "allowed" to pass non-ascii characters to the non-intrinsic is_alpha routine. Maybe I should check what is with the file encoding. I am working in a German locale.

Also, FYI, IIRC: compilers may not be required to handle non-ascii characters by the standard. I seem to recall this to be true from when I implemented Unicode support in JSON-Fortran.

In that case is the best we can do to simply state in the documentation the behavior is undefined for non-ascii symbols?

zbeekman commented 4 years ago

Sorry, I wasn't entirely clear in my previous comment. My recollection is that compilers are not required to handle non-ascii characters in program source code. But I may be mistaken here.

I seem to recall having to use the backslash notation with GFortran:

-fbackslash
Change the interpretation of backslashes in string literals from a single backslash character to “C-style” escape characters. The following combinations are expanded \a, \b, \f, \n, \r, \t, \v, \\, and \0 to the ASCII characters alert, backspace, form feed, newline, carriage return, horizontal tab, vertical tab, backslash, and NUL, respectively. Additionally, \xnn, \unnnn and \Unnnnnnnn (where each n is a hexadecimal digit) are translated into the Unicode characters corresponding to the specified code points. All other combinations of a character preceded by \ are unexpanded.

https://gcc.gnu.org/onlinedocs/gfortran/Fortran-Dialect-Options.html

zbeekman commented 4 years ago

Also, file encoding issues may be in play here, as you noted.

ivan-pi commented 4 years ago

@dev-zero wrote in https://github.com/fortran-lang/stdlib/pull/32#issuecomment-618932685:

since I've been pointed here, this project might be interesting: https://github.com/lemire/fastvalidate-utf-8 althought I don't see how that could be implemented in Fortran given missing inline assembly.

Thanks for your suggestion! We could also just directly call the C routines.

Over at https://github.com/ivan-pi/fortran-ascii/tree/master I've actually prepared 4 different versions of the character validation routines (three in Fortran, and one directly calling the C routines). I've done some micro-benchmarking and the differences can be up to a factor of 4. C++ still comes out a tiny bit faster for some reason. I'm writing a blog post about it (hopefully I finish it this weekend).

certik commented 4 years ago

I use the following simple routines in C++ to encode and decode unicode strings: https://github.com/certik/terminal/blob/69ee07e5aee2fe4c4bff4fa164364ec049c66069/terminal.h#L430

Here is how to use the utf8_decode_step routine: https://github.com/certik/terminal/blob/69ee07e5aee2fe4c4bff4fa164364ec049c66069/terminal.h#L493. If you are only interested to know if it is a valid string, one can return true / false instead of rising the exception.

I wrote the codepoint_to_utf8 routine myself, but I took the utf8_decode_step one from somewhere else --- I think these might be pretty fast and competitive and are very simple to implement in Fortran I think.

ivan-pi commented 4 years ago

I've done some testing of different implementation approaches of the character validation routines:

direct approach (the one currently in stdlib) using relational operators and the character collating sequence
using select case constructs
using a lookup table and bitmask operations (the way it's typically done in the C/C++ library)
interface to the C standard library

If anyone is interested, the results are available here (scroll up for a description). There is no clear winner (besides C++).

The results do change around 5 % if I switch compiler flags, or even if I change the order of comparison in certain relational operations. I did not check what's going on at assembly level, and I think even the timing routines might skew the results somehow (e.g. the Fortran timings improved when I switched from cpu_time to system_count)

If anyone has some suggestions, how to improve the routines or make the measurements more accurate, please open an issue at my repository: https://github.com/ivan-pi/fortran-ascii I'd also be interested in results from other compilers, operating systems or processors.

milancurcic commented 3 years ago

@ivan-pi can this issue be closed or is the ascii module still an open-ended project?

ivan-pi commented 3 years ago

The specifications of the character validation routines are still missing.

I will see to have them done soon.

aman-godara commented 3 years ago

Hey @ivan-pi, If you need any help with coding or anything else regarding this issue, please feel free to allot me tasks. Thank you!

fortran-lang / stdlib

Proposal for ascii #11