Open PipaFlores opened 1 week ago
It might be that is less dramatic than I portray it to be. In the coverage html file for _bertopic.py
(html cov zip file below), most of the pipeline is green. And the _extract_topics() part is red (line 479-510). While I have no clue how the red/blue lines are actually defined, it might be that only that part is not covered (which involves topic reduction (if needed) -> vectorizer/ctidf ->
representation models)
Thanks for opening up a separate issue and detailing the underlying problem. Coverage is difficult with a modular framework but this really should have been captured by the tests since it is core functionality. I have limited time currently (most of my time is spend answering issues and reviewing PRs) but might shift the balance in the future.
Feature request
As identified in PR #2191, the current test units do not cover the process of fitting a model. In other words, is not testing the implementation of
fit_transform()
. Consequently, different current, and future, features that are performed at thefit_transform()
level are not tested in any systematic way. We realized about this when debugging topic reduction by declaring anr_topics
for the model before fitting. However, this issue might involve all the core features, and most of the optional ones.Currently, in conftest.py the tests define the models and
fit
them for further testing in the other units. https://github.com/MaartenGr/BERTopic/blob/c3ec85dec8eeac704b30812dfed4ac8cd7d13561/tests/conftest.py#L50C1-L55C17As such, some improvement is required for the tests to cover for the
fit_transform()
method, the core of the library.Motivation
This is required to systematically test the internal consistency of all features and the overall work pipeline.
Your contribution
I can't tackle this issue yet due to time availability, since I will need to familiarize myself more with the pytest framework first. I will come back in a future to tackle this, but I leave the issue open as a reminder, and in case someone else is up for the challenge.