Functions to analyse and manipulate dependency trees, as well as conversions between GF and dependency trees. The main use case is UD (Universal Dependencies), but the code is designed to be completely generic as for annotation scheme. This repository replaces the old gf-contrib/ud2gf code. It is also meant to be used in the 'vd' command of GF and replace the supporting code in gf-core in the future.
The old code tries to apply every possible abstract gf function to every possible combination af GF trees it has found so far, removes any duplicate trees it has found, and iterates this until it can no longer find any new trees.
This is needlessly slow for two reasons:
First issue
Since we keep the old trees around, we will keep creating the same trees over and over again for each iteration.
For example:
If we start with the word [small_A]
We can apply PositA to this and now we have these trees: [PositA small_A, small_A]
We can now apply CompAP to the first tree and PositA to the second tree. This generates these "new" trees: [CompAP (PositA small_A), PositA small_A]
However, we already had Posit smallA from the previous iteration, so now need to remove deduplicate the list of trees, resulting in: [CompAP (PositA small_A), PositA small_A, small_A]
Now we can apply functions to each of these, and end up with three new trees, only one of which is actually new.
This results in O(T*F*CT1*CTn) per iteration, where T is the current number of trees, CT1 is the nuber of possible trees for the first dependent and F is the number of abstract functions available. The nuber of iterations should correspond to the maximum depth of the final trees.
Solution
To solve this issue, all we need to do is to only build on top of new trees, since applying functions to the old trees will only produce trees we have already seen, so that work is always redundant.
Second issue
We have now removed most duplicate work related to the head of the devtree, but it still generates a lot of duplicate trees. The reason for this is that before trying to match against a function, all possible combinations of head tree and dependent trees were calculated, even if the function in question would only take one or two arguments.
Solution
Instead of first expanding all the possible combinations of GF trees, we keep the tree-shape of the devtree and for each function we first choose a valid head tree and then only decide on as many dependent trees as needed by the function, and only if it managed to create find valid trees for the previous arguments. This can be solved neatly with a list-comprehension.
Deduplication
Now, since we no longer produce duplicates, we can remove the deduplication code. This has however not yet been done, out of my cowardice.
There is however still one way to produce duplicates: If the initial PGF.parse on the idividual words generated multiple trees, where one is a subtree of another, we may generate one the larger tree again.
One idea that might solve this issue, and might also provide other advantages, is to use morphoanalyze instead of parse, so we only get lexical or syncategorematic entries. That would however require reworking the algorithm a bit more.
Performance
With both fixes
With these changes upto12eng.conlu takes 17.5 seconds for ud2gf on my machine:
$ stack run gf-ud ud2gf grammars/ShallowParse Eng Text -- +RTS -s < upto12eng.conllu
...
8,316,725,552 bytes allocated in the heap
15,835,906,496 bytes copied during GC
222,330,424 bytes maximum residency (36 sample(s))
3,737,472 bytes maximum slop
212 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7855 colls, 0 par 8.971s 9.107s 0.0012s 0.0037s
Gen 1 36 colls, 0 par 5.383s 5.702s 0.1584s 0.2869s
INIT time 0.000s ( 0.004s elapsed)
MUT time 3.299s ( 3.360s elapsed)
GC time 14.354s ( 14.809s elapsed)
EXIT time 0.000s ( 0.008s elapsed)
Total time 17.654s ( 18.182s elapsed)
%GC time 0.0% (0.0% elapsed)
With only one fix
With only the first half of this change, it takes around 15-30 minutes to complete:
$ stack run gf-ud ud2gf grammars/ShallowParse Eng Text -- +RTS -s < upto12eng.conllu
...
1,065,219,555,280 bytes allocated in the heap
27,001,240,040 bytes copied during GC
350,466,472 bytes maximum residency (59 sample(s))
4,709,976 bytes maximum slop
334 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1020317 colls, 0 par 36.792s 39.373s 0.0000s 0.0307s
Gen 1 59 colls, 0 par 18.062s 19.380s 0.3285s 1.1113s
INIT time 0.000s ( 0.004s elapsed)
MUT time 1012.890s (7124.381s elapsed)
GC time 54.855s ( 58.753s elapsed)
EXIT time 0.000s ( 0.003s elapsed)
Total time 1067.745s (7183.142s elapsed)
Original performance
With neither of the optimizations, it takes longer than I am willing to wait.
The old code tries to apply every possible abstract gf function to every possible combination af GF trees it has found so far, removes any duplicate trees it has found, and iterates this until it can no longer find any new trees.
This is needlessly slow for two reasons:
First issue
Since we keep the old trees around, we will keep creating the same trees over and over again for each iteration.
For example:
[small_A]
PositA
to this and now we have these trees:[PositA small_A, small_A]
CompAP
to the first tree andPositA
to the second tree. This generates these "new" trees:[CompAP (PositA small_A), PositA small_A]
Posit smallA
from the previous iteration, so now need to remove deduplicate the list of trees, resulting in:[CompAP (PositA small_A), PositA small_A, small_A]
This results in
O(T*F*CT1*CTn)
per iteration, whereT
is the current number of trees,CT1
is the nuber of possible trees for the first dependent andF
is the number of abstract functions available. The nuber of iterations should correspond to the maximum depth of the final trees.Solution
To solve this issue, all we need to do is to only build on top of new trees, since applying functions to the old trees will only produce trees we have already seen, so that work is always redundant.
Second issue
We have now removed most duplicate work related to the head of the devtree, but it still generates a lot of duplicate trees. The reason for this is that before trying to match against a function, all possible combinations of head tree and dependent trees were calculated, even if the function in question would only take one or two arguments.
Solution
Instead of first expanding all the possible combinations of GF trees, we keep the tree-shape of the devtree and for each function we first choose a valid head tree and then only decide on as many dependent trees as needed by the function, and only if it managed to create find valid trees for the previous arguments. This can be solved neatly with a list-comprehension.
Deduplication
Now, since we no longer produce duplicates, we can remove the deduplication code. This has however not yet been done, out of my cowardice.
There is however still one way to produce duplicates: If the initial
PGF.parse
on the idividual words generated multiple trees, where one is a subtree of another, we may generate one the larger tree again.One idea that might solve this issue, and might also provide other advantages, is to use morphoanalyze instead of parse, so we only get lexical or syncategorematic entries. That would however require reworking the algorithm a bit more.
Performance
With both fixes
With these changes upto12eng.conlu takes 17.5 seconds for ud2gf on my machine:
With only one fix
With only the first half of this change, it takes around 15-30 minutes to complete:
Original performance
With neither of the optimizations, it takes longer than I am willing to wait.