Chunker module: refactor and test

jiqicn commented 1 year ago

This PR is for refactoring the code in the chunker.py file and adding unit tests for some of the functions.

To review this PR, it would be nice to focus on the two functions, get_chunk and get_chunk_rank. For these two functions, logic is implemented by myself. The rest of the functions and classes are either wrappers of existing third-party packages or very simple. For the same reason, my unit tests are also focused on the two functions mentioned above.

In general, I would like the reviewer to have a look at the code to check if it follows the best practices, and if the unit tests are sufficient. In particular, there are two things that can be improved from my viewpoint:

The performance of the fit_transform_reduced function of the TopicModel class. The runtime performance of the get_chunk_topic function looks like this:

After profiling, the bottleneck here is found to be the fit_transform_reduced function. However, according to the documentation of the BERTopic package (here), there seems to be little to do from our side.

Dependency parsing in the get_chunk function. For now, the function is implemented in a very simple way, where headwords are detected as those whose dependency is a conjunction. As tests show, it satisfied all the pre-defined cases that we can think about, but there must be some improvement here.

jiqicn commented 1 year ago

Hi @carschno, I resolved all the conversations above to make things clean, but feel free to unresolved any of them if you think that is necessary.

Also, would you be able to have a look at the new tests added with commits 209966a73acce84daaedb9ccd5b845b4d2253c83 and a991b4b025807f1cc0cdf73f471f9ee54d985b55? Those are unit tests of the TopicModel class and some integration tests. Before, I felt that it wasn't necessary to test some methods and the TopicMode class extensively because most of them simply invoke third-party libraries and have very simple logic. However, in the end, I decided to include them to make the testing more comprehensive. Please don't hesitate to let me know if you won't have the time for this task. Many thanks!

carschno commented 1 year ago

Also, would you be able to have a look at the new tests added with commits 209966a and a991b4b? Those are unit tests of the TopicModel class and some integration tests. Before, I felt that it wasn't necessary to test some methods and the TopicMode class extensively because most of them simply invoke third-party libraries and have very simple logic. However, in the end, I decided to include them to make the testing more comprehensive. Please don't hesitate to let me know if you won't have the time for this task. Many thanks!

I have added some minor style ideas, but they look good to me! I think it is a good idea to add the tests in this stage, even they do not seem strictly necessary now. When they become necessary, it will be much more work to add the tests.

jiqicn commented 1 year ago

Also, would you be able to have a look at the new tests added with commits 209966a and a991b4b? Those are unit tests of the TopicModel class and some integration tests. Before, I felt that it wasn't necessary to test some methods and the TopicMode class extensively because most of them simply invoke third-party libraries and have very simple logic. However, in the end, I decided to include them to make the testing more comprehensive. Please don't hesitate to let me know if you won't have the time for this task. Many thanks!

I have added some minor style ideas, but they look good to me! I think it is a good idea to add the tests in this stage, even they do not seem strictly necessary now. When they become necessary, it will be much more work to add the tests.

Many thanks @carschno!

jiqicn commented 1 year ago

I would prefer to merge and close this PR.