Wordseer / wordseer

The WordSeer text analysis tool, written in Flask.
http://wordseer.berkeley.edu/
40 stars 16 forks source link

Dependency inconsistencies #125

Closed keien closed 10 years ago

keien commented 10 years ago

I'm running some basic tests on the personal ads dataset that Aditi sent me, and I found an interesting inconsistency in the dependencies. When I queried the SQL dump directly, I got these first couple dependencies:

+----------+-------------+--------+--------+--------------+---------------------------------+------------+---------+-----------+---------+-----------+-----------+-------+-------------+
| id       | relation_id | gov_id | dep_id | relationship | string                          | gov        | gov_pos | dep       | dep_pos | dep_count | gov_count | count | information |
+----------+-------------+--------+--------+--------------+---------------------------------+------------+---------+-----------+---------+-----------+-----------+-------+-------------+
|    11312 |           1 |      3 |      2 | det          | det(lot/NN, a/DT)               | lot        | NN      | a         | DT      |        -1 |        -1 |    -1 |           0 |
|    21514 |           2 |      5 |      4 | aux          | aux(open/VB, to/TO)             | open       | VB      | to        | TO      |        -1 |        -1 |    -1 |           0 |
|    31716 |           3 |      7 |      6 | poss         | poss(ad/NN, my/PRP$)            | ad         | NN      | my        | PRP$    |        -1 |        -1 |    -1 |           0 |
|    41517 |           4 |      5 |      7 | dobj         | dobj(open/VB, ad/NN)            | open       | VB      | ad        | NN      |        -1 |        -1 |    -1 |           0 |

from the sentence, "Thanks a lot to open my ad. " (it's the first sentence of the first ad document).

Now, when I tried parsing the sentence directly using raw_parse, I get this:

[('root', 'ROOT', '0', 'Thanks', '1'), ('det', 'lot', '3', 'a', '2'), ('dep', 'Thanks', '1', 'lot', '3'), ('aux', 'open', '5', 'to', '4'), ('infmod', 'Thanks', '1', 'open', '5'), ('poss', 'ad', '7', 'my', '6'), ('dobj', 'open', '5', 'ad', '7')]

I know the root dependency gets removed, but that still leaves two extra dependencies that the original doesn't have. Do you know why that is?

abendebury commented 10 years ago

Yes.

The original (as does ours) has code to remove dependencies with ROOT. ROOT has an index of 0. The original code had an off-by-one error which removed all dependencies in which one or more words had an index of 1 or less instead of 0 or less.

So, in this case, the original would have removed the first, third, and fifth dependencies - the third and the fifth are the ones missing from the database query.

keien commented 10 years ago

Got it, thanks for confirming.