DatabaseGroup / tree-similarity

Library for tree similarity algorithms and queries.
MIT License
72 stars 14 forks source link

Usage and binary question #32

Open hohilwik opened 1 month ago

hohilwik commented 1 month ago

I have been trying to understand how to use any single algorithm from the library in a separate file. Do I just include the specific header file? A few sentences of documentation would be greatly appreciated,

I also wanted to ask what the default algorithm is which is used by the ted binary.

Thanks for your great work on this library.

Edit: figured out that the default is APTED

mateuszpawlik commented 1 month ago

Thanks @hohilwik for your interest. I'm happy you like our work.

I have been trying to understand how to use any single algorithm from the library in a separate file. Do I just include the specific header file? A few sentences of documentation would be greatly appreciated,

@remz1337 kindly ported our library to VCPKG. We have a short instruction of how to use it. Then, you should be able to include the desired algorithm.

The algorithm file is not enough. You also need the parser, cost model, tree data structure, and the label. For example, to include the APTED algorithm in your code you need:

#include "node.h"
#include "string_label.h"
#include "unit_cost_model.h"
#include "bracket_notation_parser.h"
#include "apted_tree_index.h"

If needed, you can implement custom label and cost model.

You can also have a look at our CMakeLists.txt if you don't want to use VCPKG.

Let me know if you need more information.

thuetter commented 1 month ago

Thanks for your request @hohilwik!

You can also check out our work on JSON similarity. Our experimental evaluation (see https://github.com/DatabaseGroup/jedi-experiments) uses this repository as an external library and therefore provides a (hopefully simple) example of how to include and use the tree-similarity library without VCPKG.

hohilwik commented 1 month ago

Thanks a lot @mateuszpawlik @thuetter

I am actually using the library for an NLP task, millions of small trees(20-400 nodes), and everything else was too slow. Came across this library from your paper, thanks a lot for releasing your code along with the paper.

For now, I have been modifying command_line/main.cc to give myself better functionality. I could send a pull request if you are interested in that. I mostly work with C and am not very familiar with C++ templates so there is not much error handling. Just some added functionality for the ted binary to be able to choose algorithm in argument, and to specify a line-by-line mode to go through many trees in two input files. I used /test/ted/test_ted.cc as reference

I am currently using touzet_kr_set_algorithm for this, but if you have suggestions for any other algorithm which might be better, that would also be great.

mateuszpawlik commented 1 month ago

A pull request would be great :slightly_smiling_face: I only need to warn you, that we don't have that much time to work on this repository :neutral_face: