clulab / eidos

Machine reading system for World Modelers
Apache License 2.0
37 stars 24 forks source link

Be nitpicky about it #1035

Closed kwalcock closed 3 years ago

kwalcock commented 3 years ago

Document my magic number Do a count once instead of filter.length Try lazy val within method

kwalcock commented 3 years ago

I wonder if 5 or 6 is too low a value for the single char, especially given that most punctuation is a single char and has already been counted. The period at the end of the sentence counts for one and pairs of quotes and parentheses count as two.

BeckySharp commented 3 years ago

nice thought, this is one of the reasons I love working with you. I didn't even think about it. Maybe instead we should check for only _.isLetter and .length == 1 (if you get my drift)

On Wed, Jun 9, 2021 at 3:38 PM Keith Alcock @.***> wrote:

External Email

I wonder if 5 or 6 is too low a value for the single char, especially given that most punctuation is a single char and has already been counted. The period at the end of the sentence counts for one and pairs of quotes and parentheses count as two.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

kwalcock commented 3 years ago

I would be (a little) more comfortable with that, I guess. The previous sentence already has 6.

kwalcock commented 3 years ago

Quite a few tests are failing. The first one I checked had filtered out this sentence:

Crisis (IPC Phase 3) is widespread and
Emergency (IPC Phase 4) outcomes exist in parts of Unity,
Western Bahr el Ghazal,

This was even after the change to the single letter condition. The numbers probably need to be higher or the kinds of symbols different.

BeckySharp commented 3 years ago

OK, i'll look tomorrow, thanks for point this all out!

On Wed, Jun 9, 2021 at 10:45 PM Keith Alcock @.***> wrote:

External Email

Quite a few tests are failing. The first one I checked had filtered out this sentence:

Crisis (IPC Phase 3) is widespread and Emergency (IPC Phase 4) outcomes exist in parts of Unity, Western Bahr el Ghazal,

This was even after the change to the single letter condition. The numbers probably need to be higher or the kinds of symbols different.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

kwalcock commented 3 years ago

I could output the counts for each measurement and then look at their distribution for at least the tests so that they can be made to pass. It might be a good idea to check a real corpus eventually.

kwalcock commented 3 years ago

For unit test texts, here's a histogram for numNonAlpha:

count instances
0 2
1 7378
2 1108
3 733
4 443
5 322
6 480
7 330
8 155
9 52
10 53
11 9
12 20
13 10
14 16
15 1
16 17
17 0
18 13
19 1
20 0
21 1
22 0
23 1
24 2
25 0
26 0
27 0
28 0
29 1
30 0
31 0
32 0
33 2
34 0
35 0
36 0
37 0
38 0
39 0
40 0
41 0
42 0
43 0
44 0
45 0
46 1
47 0
48 0
49 0
50 0
51 0
52 0
53 0
54 0
55 0
56 0
57 0
58 0
59 0
60 0
61 0
62 0
63 0
64 2
65 0
kwalcock commented 3 years ago

Here it is for numSingleAlpha:

count instances
0 10384
1 666
2 92
3 8
4 1
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 2
16 0
kwalcock commented 3 years ago

The 33 and 15 are for the one sentence we're trying to avoid.

kwalcock commented 3 years ago

We know all the one-letter English words, don't we? I and a. Perhaps they should not be counted.

kwalcock commented 3 years ago

Maybe not count valid numbers and punctuation. For purposes of ruling out the single sentence, numSingleAlpha < 10 is sufficient. We're getting too many false fubars otherwise IMO.

BeckySharp commented 3 years ago

sounds fine to me

On Fri, Jun 11, 2021 at 11:07 AM Keith Alcock @.***> wrote:

External Email

Maybe not count valid numbers and punctuation. For purposes of ruling out the single sentence, numSingleAlpha < 10 is sufficient. We're getting too many false fubars otherwise IMO.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

kwalcock commented 3 years ago

This code isn't as elegant as that of @BeckySharp, but it does pass the tests.

kwalcock commented 3 years ago

Do you think this is OK, @BeckySharp? The numSingleAlpha from the table above looks like a more reliable indicator of a problem that causes the program to hang. Because there are some valid 1-letter words in English that I wanted to take into account, the code changed location a little.

kwalcock commented 3 years ago

Good, valid point. I flipped a coin. Luckily it's only about three lines of code. I imagined a table detector, graph detector, PDF problem detector, ways to combine them, plugins, and language specializations... that would eventually be easier if it was a class. For today, it's close enough for me.