Character points and Counts

HughP commented 6 years ago

@iandoug

In #6 you say

Just trying to persuade PHP to show me the U+... form, think I need to switch to PHP 7 to get that functionality. Or try and find them in the hex editor.

FYI the tool I use to get my character counts is UnicodCCount. It was originally created by Bob Halissy of SIL, and I have recently taken interest in adding some options. you can find my version here: https://github.com/HughP/UnicodeCCount . It is a perl script so $ chmod +x and then use on the command line.

iandoug commented 6 years ago

Thanks. Haven't done much in Perl lately because work mostly in PHP which is easier to read/debug, if not as powerful for playing with text. Will take a looks.

iandoug commented 6 years ago

At the moment that script seems to basically count the chars and print it out. I see there is a -o option to write to a file. I have my own version which basically does the same as that does (ie count chars and print out with Unicode point).

In these sort of cases I normally write the files out using the tilde character as a divider (instead of tab or comma), and then let Libreoffice import it, it gives you the option to specify non-standard divider characters. In this case I would probably use the "not" ¬ character since that's not in the corpus.

Once in a spreadsheet it's easy to play with. It's also possible to swap in replacements for some characters, like ⍽ for space (or simply [space], likewise with Tab ⇥ or somesuch.).

I read your wishlist but most of it went over my head due to lack of knowledge on linguistics.. :-)

PHP is behaving itself for multibyte things now ... and I think tonight I saw why my Perl french-stripper program was behaving strangely, had not told it it was dealing with UTF-8 stuff. I did it in Perl instead of PHP because your script calls Perl scripts. I see there are still some sections of French/Dan stuff left, but they are not like the others. Will add tests to remove them.

iandoug commented 6 years ago

Okay regenerated your corpus using your bash file, then had some issues playing with it until I discovered that it somehow now has Mac end of line markers...

Anyway, quick-and-dirty php proggie and resulting .csv, delimited by ¬ character. Prints out character count suitable for spreadsheet (or wrapping fish, if you like MAD magazine).

There are still French sections in the text. Probably we should get rid of the separate combining diaeresis characters and merge them onto the preceding character where possible (except hook v).

And all those hyphen-minuses :-)

The other thing that bothers me about the corpus (from a keyboard design point of view) is that the source was converted newspapers, and they brought their columns along for the ride. This results in unusually short lines and excess carriage returns, which messes up the statistical analysis of hand and finger use.

character-counter.zip

HughP commented 6 years ago

What line endings should there be in the corpus?

On Tue, Jun 19, 2018 at 2:13 PM, Ian Douglas notifications@github.com wrote:

Okay regenerated your corpus using your bash file, then had some issues playing with it until I discovered that it somehow now has Mac end of line markers...

Anyway, quick-and-dirty php proggie and resulting .csv, delimited by ¬ character. Prints out character count suitable for spreadsheet (or wrapping fish, if you like MAD magazine).

There are still French sections in the text. Probably we should get rid of the separate combining diaeresis characters and merge them onto the preceding character where possible (except hook v).

And all those hyphen-minuses :-)

The other thing that bothers me about the corpus (from a keyboard design point of view) is that the source was converted newspapers, and they brought their columns along for the ride. This results in unusually short lines and excess carriage returns, which messes up the statistical analysis of hand and finger use.

character-counter.zip https://github.com/HughP/dnj-corpus/files/2116930/character-counter.zip

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HughP/dnj-corpus/issues/9#issuecomment-398548007, or mute the thread https://github.com/notifications/unsubscribe-auth/AACdJquhjOohLS7BwyWC5C9zZdXvYXkzks5t-WmCgaJpZM4UhkT8 .

iandoug commented 6 years ago

Good question. Usually I strip whatever is in the source material and add the usual "\n" to the end, the program (Perl/PHP/whatever) is then supposed to write the appropriate line ending for the system it's running on.

Not sure what you changed because previous time I generated the corpus I didn't pick up that issue (or maybe just didn't notice because I didn't do anything with the result that would have exposed it. When I counted the characters I did not strip the end of line because I wanted to count everything, and this may have exposed the difference.)

HughP commented 6 years ago

Well what I did was moved all line feeds and carriage returns to 000D, I noticed the problem in vim where characters showed up as ^M.

I changed the script to move 000D and 000C to 000A. But perhaps then just doing \n would be better and let perl decide on the encoding.... I'm still trying to delete empty lines... there seem to be quite a few at the end of the corpus.

On Tue, Jun 19, 2018 at 10:51 PM, Ian Douglas notifications@github.com wrote:

Good question. Usually I strip whatever is in the source material and add the usual "\n" to the end, the program (Perl/PHP/whatever) is then supposed to write the appropriate line ending for the system it's running on.

Not sure what you changed because previous time I generated the corpus I didn't pick up that issue (or maybe just didn't notice because I didn't do anything with the result that would have exposed it. When I counted the characters I did not strip the end of line because I wanted to count everything, and this may have exposed the difference.)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HughP/dnj-corpus/issues/9#issuecomment-398632098, or mute the thread https://github.com/notifications/unsubscribe-auth/AACdJusiqt2NN785RPka2_nSIUsF1DfGks5t-eLdgaJpZM4UhkT8 .

iandoug commented 6 years ago

I changed the script to move 000D and 000C to 000A. But perhaps then just doing \n would be better and let perl decide on the encoding.... I'm still trying to delete empty lines... there seem to be quite a few at the end of the corpus.

Um, I see the last line in your script is meant to use a cat trick to remove them... I suspect that the lines are not actually empty and may contain a space (followed by line end).

I'm actually busy morphing all your regex script commands into a PHP program and have done about 70% I guess. It also removes all the French/Dan lesson stuff, best as I can see. ( Removed 21 sections of French.)

I'll see if the cat trick works on the end result else can make it part of the PHP program (if current line is blank and previous line was blank, don't write current line). Cat would be easier if it worked.

L8tr :-)

HughP commented 6 years ago

This seems to have been taken care in the most recent version of the .bash script.

HughP / dnj-corpus

Character points and Counts #9