elshize / irkit

Information Retrieval tools intended for academic research.
https://elshize.github.io/irkit/
MIT License
6 stars 2 forks source link

irk-score fails with memory corruption #49

Closed elshize closed 5 years ago

elshize commented 5 years ago

Index

moa:/data/index/irkit/cw09b-nospam

Command

~/irkit/build/bin/irk-score

Log

[2018-11-05 17:30:36.058] [score] [info] Initiating scoring using 8 threads
[2018-11-05 17:30:36.810] [score] [info] Calculating max score
[2018-11-05 17:45:45.419] [score] [info] Max score: 8.48841e+07; Min score: -1.09753e+07
malloc(): memory corruption
Aborted (core dumped)
elshize commented 5 years ago

Unexpectedly (?), many scores are negative:

$ ~/irkit/build/bin/irk-postings obama --score *bm25 | head
570 -0.163575
662 -0.592886
663 -0.652508
664 -0.858155
665 -0.865461
666 -0.860576
667 -0.812432
668 -0.865461
669 -0.841578
elshize commented 5 years ago

It seems to be a problem with building index instead. Avg. document size is negative!

{
    "avg_document_size": -21.23560605242698,
    "documents": 37512555,
    "max_document_size": 219400,
    "occurrences": 29268169232,
    "skip_block_size": 64
}
elshize commented 5 years ago

Problem is most likely with:

int64_t sum_doc_size = std::reduce(
    std::execution::par_unseq, sizes.begin(), sizes.end(), 0);

The initial value must be 64-bit to avoid overflow.

elshize commented 5 years ago

This is a good opportunity to take care of #14

elshize commented 5 years ago

Suspicion confirmed.