Closed jiqicn closed 1 year ago
Hi @carschno, I resolved all the conversations above to make things clean, but feel free to unresolved any of them if you think that is necessary.
Also, would you be able to have a look at the new tests added with commits 209966a73acce84daaedb9ccd5b845b4d2253c83 and a991b4b025807f1cc0cdf73f471f9ee54d985b55? Those are unit tests of the TopicModel class and some integration tests. Before, I felt that it wasn't necessary to test some methods and the TopicMode class extensively because most of them simply invoke third-party libraries and have very simple logic. However, in the end, I decided to include them to make the testing more comprehensive. Please don't hesitate to let me know if you won't have the time for this task. Many thanks!
Also, would you be able to have a look at the new tests added with commits 209966a and a991b4b? Those are unit tests of the TopicModel class and some integration tests. Before, I felt that it wasn't necessary to test some methods and the TopicMode class extensively because most of them simply invoke third-party libraries and have very simple logic. However, in the end, I decided to include them to make the testing more comprehensive. Please don't hesitate to let me know if you won't have the time for this task. Many thanks!
I have added some minor style ideas, but they look good to me! I think it is a good idea to add the tests in this stage, even they do not seem strictly necessary now. When they become necessary, it will be much more work to add the tests.
Also, would you be able to have a look at the new tests added with commits 209966a and a991b4b? Those are unit tests of the TopicModel class and some integration tests. Before, I felt that it wasn't necessary to test some methods and the TopicMode class extensively because most of them simply invoke third-party libraries and have very simple logic. However, in the end, I decided to include them to make the testing more comprehensive. Please don't hesitate to let me know if you won't have the time for this task. Many thanks!
I have added some minor style ideas, but they look good to me! I think it is a good idea to add the tests in this stage, even they do not seem strictly necessary now. When they become necessary, it will be much more work to add the tests.
Many thanks @carschno!
I would prefer to merge and close this PR.
SonarCloud Quality Gate failed.
0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells
57.1% Coverage
0.0% Duplication
Catch issues before they fail your Quality Gate with our IDE extension SonarLint
This PR is for refactoring the code in the chunker.py file and adding unit tests for some of the functions.
To review this PR, it would be nice to focus on the two functions,
get_chunk
andget_chunk_rank
. For these two functions, logic is implemented by myself. The rest of the functions and classes are either wrappers of existing third-party packages or very simple. For the same reason, my unit tests are also focused on the two functions mentioned above.In general, I would like the reviewer to have a look at the code to check if it follows the best practices, and if the unit tests are sufficient. In particular, there are two things that can be improved from my viewpoint:
fit_transform_reduced
function of theTopicModel
class. The runtime performance of theget_chunk_topic
function looks like this:After profiling, the bottleneck here is found to be the
fit_transform_reduced
function. However, according to the documentation of the BERTopic package (here), there seems to be little to do from our side.get_chunk
function. For now, the function is implemented in a very simple way, where headwords are detected as those whose dependency is a conjunction. As tests show, it satisfied all the pre-defined cases that we can think about, but there must be some improvement here.