JMMackenzie / IOQP

Impact Ordered Query Processing
Apache License 2.0
3 stars 0 forks source link

Incorrect Document Identifiers #3

Closed JMMackenzie closed 3 years ago

JMMackenzie commented 3 years ago

After query processing, the expected document identifiers are not yielded.

For example, consider 1087114:varicose complications what vein

We expect the results to be (note I have mapped these back to raw docids, not "trec" docids):

1087114 Q0 7104604 1 481 --> DOCID:1413
1087114 Q0 3039969 2 437 --> DOCID:46631
1087114 Q0 7279404 3 436 --> DOCID:1229
1087114 Q0 2795191 4 427 --> DOCID:1396
1087114 Q0 7278731 5 424 --> DOCID:6786

However, we get back:

1087114 Q0 412 1 481 ioqp
1087114 Q0 45630 2 437 ioqp
1087114 Q0 228 3 436 ioqp
1087114 Q0 395 4 427 ioqp
1087114 Q0 5785 5 424 ioqp

This happens with both determine_topk_chunks() and determine_topk(), and both uncompressed and compressed indexes.

Note that the docids are all off by 1001 for this specific query.

Another example (1000000:real come where insulin) -- all docids are again off by 1001 here.

1000000 Q0 3639919 1 309 --> DOCID:867793
1000000 Q0 1298148 2 301 --> DOCID:867921
1000000 Q0 3782789 3 301 --> DOCID:867911
1000000 Q0 2584019 4 300 --> DOCID:902944
1000000 Q0 784657 5 300 --> DOCID:832423
1000000 Q0 1975289 6 300 --> DOCID:832419
1000000 Q0 866792 1 309 ioqp
1000000 Q0 866920 2 301 ioqp
1000000 Q0 866910 3 301 ioqp
1000000 Q0 901943 4 300 ioqp
1000000 Q0 831418 5 300 ioqp
1000000 Q0 831422 6 300 ioqp

So where is this 1001 coming from? Are we accidentally "compacting" empty documents somewhere? Or perhaps a bug when we're reading from CIFF?

We are also sometimes getting duplicate document identifiers back as well. I think perhaps something is not quite right with the heap functionality. Perhaps that should be a different issue, though.

JMMackenzie commented 3 years ago

I can also confirm that document identifier's below 1001 seem to be correct.

For example: 1079086 Q0 504 18 358 ioqp matches the expected 1079086 Q0 3660777 19 358 --> DOCID:504

682365 Q0 992 30 265 ioqp matches the expected 682365 Q0 6216320 29 265 --> DOCID:992

JMMackenzie commented 3 years ago

This was fixed in e3ac99 and was caused because

self.accumulators[k..]
             .iter()
             .enumerate()
             .for_each(|(doc_id, &score)| {

does not return the global slice index, but a local counter (due to enumerate I think). So, we were thinking that the first identifier out of this loop was k but it was actually 0