hierarchy analysis - Githubissues

yhamoudi commented 10 years ago

Files below are grouped in documentation folder (in PPP-NLP-classical)
Hierarchy_analysis.md: first proposition of prepocessing of the tree to limit the number of possible dependencies. The current Hierarchy analysis has been tested on about 50 examples (+ the examples of the manual)
Algorithm.md: proposition of overall algorithm
Review.md: my first comments on the relevance of Hierarchy_analysis (on other things...).

yhamoudi commented 10 years ago

A first implementation of step 5 of Algorithm.md has been done in parsetree_to_triple.py (after the big line of #######). you can test it using test4.py

yhamoudi commented 10 years ago

Implementation of step 1 of algorithm.

Some information (espcially errors) are displayed on stderr

Ezibenroc commented 10 years ago

I like your dependency collapsing function!

Ezibenroc commented 10 years ago

Before merging this branch, I would like you to write some basic tests for the new features, in tests/test_dependencytree.py for instance, or in a new file.

Ezibenroc commented 10 years ago

Do you think that it is still necessary to "merge words in capital letters that are neighbors in the tree"? They should be taged (by default as PERSON), and therefore already merged, no?

Ezibenroc commented 10 years ago

You restrict yourself to sentences with question word. Keep in mind that we should be abble to handle simple nominal questions (e.g. birth date of Tolkien).

yhamoudi commented 10 years ago

Unit tests

I will try to make them tomorrow.

Merge

About the merging: according to the (many) examples I have performed until now, merging is one of the main important step of preprocessing (and it's not efficient when it's performed after, using the dependencies). For example, in the current hierarchy, amod is not merged but it's relevant in some cases ('prime minister' for ex) and there is no way to know it from the dependency tree.

So the big question is: what should we merge during preprocessing? Some propositions are:

named entity recognition (done)
quotation (to do)
neighbors in capital letters : they are not always tagged (ex: What is the Great Barrier Reef?, Who is the Greek God of the Sea?)
more generally, we should find a way to merge compounds word (ex: managing director, prime minister)

For the last 2 points, I don't know what is the best way to do. Perhaps we should use an existing dictionnary (such as WordNet) to permorm these steps. If you have other ideas...

Question word

In fact, I think we should restrict ourselves to only one question word for the mid-term :) For example, "Who" questions. It will be difficult enough to take all our next week... (naturally we will study more cases for the final production)

Ezibenroc commented 10 years ago

Question word and Unit tests Ok!

Merging I will think about that...

yhamoudi commented 10 years ago

Hierarchy_analysis : num is no longer a general dependency
Hierarchy_review : Review of Hierarchy_analysis
General_review : Some comments
Triples_analysis : Here we are! Some rules to produce triples from tree. I'm only working with "Who"-questions for the moment (but a lot of these rules can apply to other question types).

yhamoudi commented 10 years ago

can we change the structure of the repository? I propose the following changes:

./ppp_nlp_classical becomes ./src
./tests becomes ./unittests
test1.py ... test4.py, TESTS.md go into ./tests
how can we divide properly parsetree_to_triple.py in many files? For example, can we put all preprocessing functions in a same file, the dependency analysis in another one, the triple production in another one...

Ezibenroc commented 10 years ago

Structure: First and second point: these are automatically created by the script for module creation. For consistency with other modules, I think that we should keep these names as is. Third point: ok (but need to find an other folder name, because of last point). Fourth point: the division you proposed seems good.

yhamoudi commented 10 years ago

I'm reading your code, but I have a problem with Who is the author of "Twenty Thousand Leagues Under the Sea"? for example

yhamoudi commented 10 years ago

oh, didn't seen your previous answer. Perhaps it would be relevant to change the names used by the script?

yhamoudi commented 10 years ago

ok, I think i've a found a first problem with your algo. All the nodes are merged with the first word of the quotation. It's not relevant for Who is the author of "Twenty Thousand Leagues Under the Sea"? because we want to merge everyone on the common root of the sub-tree (Leagues, see the tree with test3.py, and not Twenty). Here we want to keep the prep_ofrelation (and I think you destroy it so we loose the subtree)

Ezibenroc commented 10 years ago

Yes I was thinking about the same thing. I will rewrite the function.

yhamoudi commented 10 years ago

test1.py...test4.py renamed to demo1.py...demo4.py and placed in /demo folder

(didn't seen you have reorganized files in ppp_nlp_classical but i think the merging is ok)

I will test the new quotation merging function. Perhaps it will be time to merge the pull request after?

Ezibenroc commented 10 years ago

It is ok for me. Check the new quotation merging, and the reorganization. If it is also ok for you, you can merge.

yhamoudi commented 10 years ago

parse_tree_to_triple.py can be removed?

Ezibenroc commented 10 years ago

Sad thing with sentence When is the national day of China?, the word the is not deleted, because we firstly merge the nodes with same tag, and then delete useless words... Maybe we should do the same-tag merging after, no ?

yhamoudi commented 10 years ago

i think collapseDependency2 (in dependencyAnalysis) can be renamed to collapseDependency and quotation are no longer merged

yhamoudi commented 10 years ago

i have not understood why the is merged with national day

Ezibenroc commented 10 years ago

Yes you can rename collapseDependency. Why do you say "quotations are not merged"? It works for me... the is merged because of the namedEntityTag merging, which is performed before the dependency analysis (the, national and day all have the tag DATE)

yhamoudi commented 10 years ago

ok, it's because the is tagged DATE. I don't think it's a problem. If the modules that will process the triples (like wikidata module) are not able to remove a the it will be worst when there will be a non infinitive verb... Moreover, here it's a problem of NER, not merging

yhamoudi commented 10 years ago

displaygraph.sh doesn't like questions with quotation

Ezibenroc commented 10 years ago

Fine. You can rename collapseDependency2 and merge if you want. Concerning displaygraph.sh, you need to put a backslash before quotations: \"

yhamoudi commented 10 years ago

problem with Who wrote "1984"?

yhamoudi commented 10 years ago

the (one?) problem happens if there is only one word in the quotation. Indeed, quoteIndexToNode will not be fill in.

yhamoudi commented 10 years ago

bug has been fixed, i merge the pull request :)

ProjetPP / PPP-QuestionParsing-Grammatical

hierarchy analysis #9