Re-evaluate LLM approach on value correctness

Results from the new benchmark comparing actual min/typ/max field values:

num *EQUAL* *VALUES*:
                                           total     
                           tabular_parse:  278 (100%)   37   35    9   31   29    1   39    7    6   25   30   29
    ocr_text2_claude-3-5-sonnet-20240620:  236 ( 85%)   30   31    6   29   21    0   34    7    6   20   26   26
                       ocr_text2_llama-3:  180 ( 65%)   24   21    8   21   22    0   26    5    5   15   16   17
                       text2_gpt-4o-mini:  157 ( 56%)   20   25    2   18   16    0   23    1    1    8   21   22
                   ocr_text2_gpt-4o-mini:  149 ( 54%)   23   21    3   22   14    0   15    5    3   10   16   17

tabular_parse is the reference, because we can assume that most of the values are correct here (no LLM, it has been carefully hand crafted).

A wrongly extracted value is much worse than a missing value, because we will not notice the mistake in the results of the power calc (missing values will output nan power values).

Analysis shows that the LLM takes values from neighbouring fields or just completely random.

The converterapi pdf2txt (or pdf2ocr2txt ?) seems to extract table contents columns wise (not row wise) , this might explain the neighbour confusion.

Random values might come from LLM exhaustion and Non deterministic effects?

fl4p / fetlib

Re-evaluate LLM approach on value correctness #15