anhaidgroup / py_entitymatching

BSD 3-Clause "New" or "Revised" License
183 stars 48 forks source link

Utilising the underlying dask machinery #122

Closed habospace closed 5 years ago

habospace commented 5 years ago

It looks like that many of the computationally intensive tasks (blocking mechanisms, feature value extraction) are reimplemented using dask (in submodule: dask), but they're not installed when installing the package via pip. Have they been omitted purposefully due to the lack of testing on them (as suggested by the docstrings of the functions) or is there a way (for instance setting some installation flags) to install the dask utilities?

pjmartinkus commented 5 years ago

At the moment, these functions have not been properly tested and that is the reason they are not directly callable in the current version of the code. Unfortunately, none of these command are imported in the __init__.py file and therefore cannot be called directly.

However, there is a workaround to this problem if you install the code from the source distribution. This can be done by first cloning the repository. I'd recommend cloning the latest release branch, called rel_0_3_2. Then, you can add the desired commands to be imported by the __init__.py file by adding lines for each command you want to be able to use. The example below allows you to use the dask attribute equivalence blocker and the dask decision tree matcher:

# dask functions
from py_entitymatching.dask.dask_attr_equiv_blocker import DaskAttrEquivalenceBlocker
from py_entitymatching.dask.dask_dtmatcher import DaskDTMatcher

Once these lines have been added to the __init__.py file, you can install the code from the source distribution you have edited. Go to the main directory where the setup.py file is located. I'd recommend first installing the requirements using the requirements.txt file. Note that there are some dependency packages in the dask command code that are not included as a requirement so you will also need to install them. Additionally, to build the package you will need cython:

pip install -r requirements.txt
pip install dask cython toolz

After this is done, you can install the package. The first line below builds the cython files and the second line actually install py_entitymatching.

python setup.py build_ext --inplace
python setup.py install

Now that everything is installed, you should be able to use the dask commands that you imported in the __init__.py file earlier. For example, the use the dask attribute equivalence blocker:

import py_entitymatching as em
A = em.read_csv_metadata('./py_entitymatching/tests/test_datasets/A.csv', key='ID')
B = em.read_csv_metadata('./py_entitymatching/tests/test_datasets/B.csv', key='ID')
dask_ab = em.DaskAttrEquivalenceBlocker()
C = dask_ab.block_tables(A, B, 'zipcode', 'zipcode')

I hope this helps. If anything here is unclear or you have any remaining questions, please ask and I will try to clear things up as best that I can.

habospace commented 5 years ago

Thank you,

I followed your instructions and finally managed to install the dask modules. Although, you said already that those dask modules haven't been properly tested yet (and you might be well aware of what I am about to say), I didn't see any performance improvements relative to the original modules when testing the dask versions of the same modules.

pjmartinkus commented 5 years ago

Hi, I just wanted to follow up on this to provide a little more explanation about the dask version of commands. They generally require specific arguments to be properly set in order to actually use the dask aspects of the command.

For example, take the dask version of the attribute equivalence blocker (DaskAttrEquivalenceBlocker). When calling the block_tables command, you must set the arguments n_ltable_chunks and n_rtable_chunks because the default value for each is 1, which behaves exactly like the regular attribute equivalence blocker. If you set these commands to something else, say 4, then the command will split each table into 4 chunks and the command will run the blocker on each pair of chunks in parallel using dask. The other dask commands generally have similar arguments that specify how the task should be split up to run in parallel.

Another import aspect regarding the effectiveness of the dask commands is that they only begin to have any use when the tables get large enough to warrant the extra overhead required to split the tables and run tasks in parallel. When using smaller tables, the regular versions of commands will be faster than the task versions, even when setting the arguments properly.

You may already be fully aware of all this, but due to the lack of documentation about these commands, we thought it would be best to provide some extra explanation about how they work here in case you or others find it helpful.