Closed yhamoudi closed 10 years ago
A first implementation of step 5 of Algorithm.md has been done in parsetree_to_triple.py (after the big line of #######). you can test it using test4.py
Implementation of step 1 of algorithm.
Some information (espcially errors) are displayed on stderr
I like your dependency collapsing function!
Before merging this branch, I would like you to write some basic tests for the new features, in tests/test_dependencytree.py
for instance, or in a new file.
Do you think that it is still necessary to "merge words in capital letters that are neighbors in the tree"? They should be taged (by default as PERSON
), and therefore already merged, no?
You restrict yourself to sentences with question word. Keep in mind that we should be abble to handle simple nominal questions (e.g. birth date of Tolkien
).
Unit tests
I will try to make them tomorrow.
Merge
About the merging: according to the (many) examples I have performed until now, merging is one of the main important step of preprocessing (and it's not efficient when it's performed after, using the dependencies). For example, in the current hierarchy, amod is not merged but it's relevant in some cases ('prime minister' for ex) and there is no way to know it from the dependency tree.
So the big question is: what should we merge during preprocessing? Some propositions are:
For the last 2 points, I don't know what is the best way to do. Perhaps we should use an existing dictionnary (such as WordNet) to permorm these steps. If you have other ideas...
Question word
In fact, I think we should restrict ourselves to only one question word for the mid-term :) For example, "Who" questions. It will be difficult enough to take all our next week... (naturally we will study more cases for the final production)
Question word and Unit tests Ok!
Merging I will think about that...
can we change the structure of the repository? I propose the following changes:
Structure: First and second point: these are automatically created by the script for module creation. For consistency with other modules, I think that we should keep these names as is. Third point: ok (but need to find an other folder name, because of last point). Fourth point: the division you proposed seems good.
I'm reading your code, but I have a problem with Who is the author of "Twenty Thousand Leagues Under the Sea"?
for example
oh, didn't seen your previous answer. Perhaps it would be relevant to change the names used by the script?
ok, I think i've a found a first problem with your algo. All the nodes are merged with the first word of the quotation. It's not relevant for Who is the author of "Twenty Thousand Leagues Under the Sea"?
because we want to merge everyone on the common root of the sub-tree (Leagues
, see the tree with test3.py, and not Twenty
). Here we want to keep the prep_of
relation (and I think you destroy it so we loose the subtree)
Yes I was thinking about the same thing. I will rewrite the function.
test1.py...test4.py renamed to demo1.py...demo4.py and placed in /demo folder
(didn't seen you have reorganized files in ppp_nlp_classical but i think the merging is ok)
I will test the new quotation merging function. Perhaps it will be time to merge the pull request after?
It is ok for me. Check the new quotation merging, and the reorganization. If it is also ok for you, you can merge.
parse_tree_to_triple.py can be removed?
Sad thing with sentence When is the national day of China?
, the word the
is not deleted, because we firstly merge the nodes with same tag, and then delete useless words...
Maybe we should do the same-tag merging after, no ?
i think collapseDependency2 (in dependencyAnalysis) can be renamed to collapseDependency and quotation are no longer merged
i have not understood why the
is merged with national day
Yes you can rename collapseDependency. Why do you say "quotations are not merged"? It works for me...
the
is merged because of the namedEntityTag merging, which is performed before the dependency analysis (the
, national
and day
all have the tag DATE
)
ok, it's because the
is tagged DATE
. I don't think it's a problem. If the modules that will process the triples (like wikidata module) are not able to remove a the
it will be worst when there will be a non infinitive verb... Moreover, here it's a problem of NER, not merging
displaygraph.sh doesn't like questions with quotation
Fine. You can rename collapseDependency2 and merge if you want. Concerning displaygraph.sh, you need to put a backslash before quotations: \"
problem with Who wrote "1984"?
the (one?) problem happens if there is only one word in the quotation. Indeed, quoteIndexToNode will not be fill in.
bug has been fixed, i merge the pull request :)