JetBrains-Research / pubtrends

Scientific literature explorer. Runs a Pubmed or Semantic Scholar search and allows user to explore high-level structure of result papers
Apache License 2.0
35 stars 2 forks source link

Division by zero #293

Open olegs opened 2 years ago

olegs commented 2 years ago

To reproduce use predefined "brain computer interface" search from Pubmed.

[2021-10-14 08:34:35,747: INFO/ForkPoolWorker-1] Generating evolution topics descriptions
[2021-10-14 08:34:35,833: WARNING/ForkPoolWorker-1] /home/user/pysrc/papers/analysis/topics.py:116: RuntimeWarning: invalid value encountered in true_divide
  tokens_freqs_per_comp = tokens_freqs_per_comp / tokens_freqs_norm
[2021-10-14 08:34:35,833: WARNING/ForkPoolWorker-1] /home/user/pysrc/papers/analysis/topics.py:123: RuntimeWarning: divide by zero encountered in log
  adjusted_distance = distance.T * np.log(tokens_freqs_total)
olegs commented 2 years ago

@ctrltz is it possible to use np.log1p to avoid this problem?

ctrltz commented 2 years ago

Sure, but if tokens_freqs_total equals 0, I think it means that the whole corpus_counts contains only zeros, and one might also separate this case implicitly like:

if not corpus_counts.sum():
    return *empty descriptions here*

Did not keep evolution in mind when worked on the topics description, thanks for pointing it.

olegs commented 2 years ago

Also tokens_freqs_norm may be zero. What is correct fix for this?

ctrltz commented 2 years ago

As far as I understand, it means that some of the components have no corpus terms to be analyzed, so it would be correct to return an empty description for the respective components.

It might be simpler to plug in np.log1p at the moment to ensure stability, and I can think a bit more in the coming days.

NB: I have also fixed the previous comment in case you have used it already.