Closed kwalcock closed 3 years ago
I wonder if 5 or 6 is too low a value for the single char, especially given that most punctuation is a single char and has already been counted. The period at the end of the sentence counts for one and pairs of quotes and parentheses count as two.
nice thought, this is one of the reasons I love working with you. I didn't even think about it. Maybe instead we should check for only _.isLetter and .length == 1 (if you get my drift)
On Wed, Jun 9, 2021 at 3:38 PM Keith Alcock @.***> wrote:
External Email
I wonder if 5 or 6 is too low a value for the single char, especially given that most punctuation is a single char and has already been counted. The period at the end of the sentence counts for one and pairs of quotes and parentheses count as two.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I would be (a little) more comfortable with that, I guess. The previous sentence already has 6.
Quite a few tests are failing. The first one I checked had filtered out this sentence:
Crisis (IPC Phase 3) is widespread and
Emergency (IPC Phase 4) outcomes exist in parts of Unity,
Western Bahr el Ghazal,
This was even after the change to the single letter condition. The numbers probably need to be higher or the kinds of symbols different.
OK, i'll look tomorrow, thanks for point this all out!
On Wed, Jun 9, 2021 at 10:45 PM Keith Alcock @.***> wrote:
External Email
Quite a few tests are failing. The first one I checked had filtered out this sentence:
Crisis (IPC Phase 3) is widespread and Emergency (IPC Phase 4) outcomes exist in parts of Unity, Western Bahr el Ghazal,
This was even after the change to the single letter condition. The numbers probably need to be higher or the kinds of symbols different.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
I could output the counts for each measurement and then look at their distribution for at least the tests so that they can be made to pass. It might be a good idea to check a real corpus eventually.
For unit test texts, here's a histogram for numNonAlpha:
count | instances |
---|---|
0 | 2 |
1 | 7378 |
2 | 1108 |
3 | 733 |
4 | 443 |
5 | 322 |
6 | 480 |
7 | 330 |
8 | 155 |
9 | 52 |
10 | 53 |
11 | 9 |
12 | 20 |
13 | 10 |
14 | 16 |
15 | 1 |
16 | 17 |
17 | 0 |
18 | 13 |
19 | 1 |
20 | 0 |
21 | 1 |
22 | 0 |
23 | 1 |
24 | 2 |
25 | 0 |
26 | 0 |
27 | 0 |
28 | 0 |
29 | 1 |
30 | 0 |
31 | 0 |
32 | 0 |
33 | 2 |
34 | 0 |
35 | 0 |
36 | 0 |
37 | 0 |
38 | 0 |
39 | 0 |
40 | 0 |
41 | 0 |
42 | 0 |
43 | 0 |
44 | 0 |
45 | 0 |
46 | 1 |
47 | 0 |
48 | 0 |
49 | 0 |
50 | 0 |
51 | 0 |
52 | 0 |
53 | 0 |
54 | 0 |
55 | 0 |
56 | 0 |
57 | 0 |
58 | 0 |
59 | 0 |
60 | 0 |
61 | 0 |
62 | 0 |
63 | 0 |
64 | 2 |
65 | 0 |
Here it is for numSingleAlpha:
count | instances |
---|---|
0 | 10384 |
1 | 666 |
2 | 92 |
3 | 8 |
4 | 1 |
5 | 0 |
6 | 0 |
7 | 0 |
8 | 0 |
9 | 0 |
10 | 0 |
11 | 0 |
12 | 0 |
13 | 0 |
14 | 0 |
15 | 2 |
16 | 0 |
The 33 and 15 are for the one sentence we're trying to avoid.
We know all the one-letter English words, don't we? I and a. Perhaps they should not be counted.
Maybe not count valid numbers and punctuation. For purposes of ruling out the single sentence, numSingleAlpha < 10 is sufficient. We're getting too many false fubars otherwise IMO.
sounds fine to me
On Fri, Jun 11, 2021 at 11:07 AM Keith Alcock @.***> wrote:
External Email
Maybe not count valid numbers and punctuation. For purposes of ruling out the single sentence, numSingleAlpha < 10 is sufficient. We're getting too many false fubars otherwise IMO.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
This code isn't as elegant as that of @BeckySharp, but it does pass the tests.
Do you think this is OK, @BeckySharp? The numSingleAlpha from the table above looks like a more reliable indicator of a problem that causes the program to hang. Because there are some valid 1-letter words in English that I wanted to take into account, the code changed location a little.
Good, valid point. I flipped a coin. Luckily it's only about three lines of code. I imagined a table detector, graph detector, PDF problem detector, ways to combine them, plugins, and language specializations... that would eventually be easier if it was a class. For today, it's close enough for me.
Document my magic number Do a count once instead of filter.length Try lazy val within method