WenjinW / LATIN-Prompt

MIT License
48 stars 4 forks source link

Reproducing Question #1

Closed ajskdlf64 closed 1 year ago

ajskdlf64 commented 1 year ago

Hey guys!

Very interesting topic and high quality paper from the research team.

After reading the paper and reproducing it with reference to the repository, I had a few questions that led me to raise an issue.

  1. Is it correct that you used the boundingboxes of lines and the boundingboxes of texts in the original OCR provided by RRC leaderboard, and if so, was the performance not good when you used the boundingboxes of texts?

  2. I was wondering if the ANLS values in your paper were calculated with your own code or the results you submitted to the RRC Leaderboard?

  3. If the answer to 1 is LINE, is the overall process correct to use LINE OCR TEXT and TEXT_BOXES to get LAYOUT_RECOVER using SPACE_LAYOUT function and PROMPT_TASK to request GPT-3.5 API?

Thanks.

(P.S. There seems to be a typo in the title of the README.md :) Promot )

WenjinW commented 1 year ago

Thank you for your affirmation of our work and here are the explanations for related issues:

  1. We used the bounding boxes of lines and the bounding boxes of texts (not words) in the original OCR information provided by the RRC leaderboard. We have not tested the performance of using text boxes. Compared with line text boxes, adjacent text boxes are more likely to overlap, which may affect the accuracy of calculating the number of spaces between adjacent texts. (Recently, we have tried different OCR tools and achieved better results, which will be provided in the new version of the paper.)
  2. We used our own implementation to calculate the ANLS value on the validation set and used the results submitted to the RRC Leaderboard on the test set.
  3. Yes, we used LINE OCR TEXT and LINE TEXT_BOXES to get LAYOUT_RECOVER using SPACE_LAYOUT function and PROMPT_TASK to request GPT-3.5 API. We hope the above content is helpful to you. If you still have any questions, please feel free to discuss them further.
ajskdlf64 commented 1 year ago

Dear Researchers

First of all, I would like to express my deepest gratitude for your quick response to my question.

I would like to ask some additional questions as I still have some unanswered questions.

  1. in your answer you mentioned "texts(not words)", could you please elaborate on what you mean by "not words"?

  2. I don't have a claude account, so I'm having a bit of trouble cloning the repository and reproducing it. Is there any way to reproduce it directly without using claude?

  3. Below is an example of the prompts I was able to reproduce and generate. When I submitted the result of calling GPT using the prompts below, the score was 0.58, which is a big gap from the score written in the paper. Do you think there is anything wrong with the reproduction or what should I pay attention to when reproducing?

Thank you for reading this long question.

Example 1

You are asked to answer questions asked on a document image.
The answers to questions are short text spans taken verbatim from the document. This means that the answers comprise a set of contiguous text tokens present in the document.
Document:
                                                     AUG 3 0 1977
                    Mr. Robert B. Choate
                    Chairman
                    Council on Children, Media
                      and Merchandising
                    1346 Connecticut Avenue, N.W.
                    Washington, D. C. 20036
                    Dear Bob:
                    Thank you for your July 11, 1977 letter which raises several
                    interesting issues. I will respond to them in the order listed.
                    Item 1
                    You correctly point out that, under recent court decisions, the require-
                    ments of the Federal Advisory Committee Act--including the requirement
                    that meetings be announced in advance and open to the public--do not
                    apply to groups such as the Food and Nutrition Board. 'I agree with you
                    that FDA should be extremely careful about entering into agreements by
                    which the work product of such groups could automatically become the
                    basis for a rule or other action by FDA. Our recent practice indicates
                    that we will review such arrangements with caution. For example, we
                    have discouraged USP's efforts to become our "sole source" supplier of
                    patient package inserts for drugs, in part because their processes are
                    not publicly accessible in the same way ours are. We have made it clear
                    that while we would welcome their ideas, we would first seek public
                    comment on any USP-generated draft on which we proposed to reply.
                    We have also encouraged the Cosmetic, Toiletry, and Fragrance Association
                     to ensure that the processes--including virtually all meetings--of the
                    ingredient review program be open to public attendance and participation.
                    These efforts have met with some success. We also have made it clear
                    that FDA will not automatically accept the judgments of the panel as the
                    basis for regulatory action on specific ingredients.
                    Furthermore, our regulations governing participation of FDA employees
                    in outside standard-setting activities impose conditions designed to
                    enhance public access to the workings of the groups on which we can be
                    represented. See 21 CFR 10.95 (enclosed).
       : ..
       . ...
                       Source: https://www.industrydocuments.tesf/.

Question: What is the name of the sender?

Directly extract the answer of the question from the document with as few words as possible.

Answer:

Example 2

You are asked to answer questions asked on a document image.
The answers to questions are short text spans taken verbatim from the document. This means that the answers comprise a set of contiguous text tokens present in the document.
Document:
                     THE VISITING NURSE ASSOCIATION OF GREATER ST. LOUIS
                                STATEMENT OF OPERATIONS
                 FOR THE MONTH OF FEBRUARY 1977 COMPARED TO FEBRUARY 1976
                                                               Budget
                                     Actual          Budget    Over       % of
                                 1976       1977     1977      (Under)  Variance
      Income
        1 Home Visits          $243, 022  $342, 830
        2 Equipment Rental      18, 659
                                                   $329, 365$ 13, 465     4. 09
                                           25, 403   20, 475  4, 928      24. 07
        3 Miscellaneous            185        357      335      22
        4 Medicare Allowance    (25, 983)  (61, 249)
                                                                           6. 51
                                                    (35, 771)(25, 478)    71. 23
      9 TOTAL OPERATING INCOME $235, 883  $307 , 341$314 , 404$ ( 7, 063)( 2. 25)
      Expenses
        10 Salaries            $147 , 027 $184, 650$184, 207    443       0. 24
       11 Payroll Taxes          8, 580    10, 774  12, 360              (12. 83)
        12 Employee Benefits     6, 812
                                                             ( 1, 586)
                                           14, 352   14, 520
                                                     17, 280
                                                               168)
        13 Transportation        14, 096   13 , 538          ( 3, 742)
                                                                         ( 1. 16)
                                 6, 018     3, 415
                                                                         (21. 66)
        14 Supplies-Administrative                   6, 940
                                 6, 140     9 , 487  11, 920
                                                             ( 3, 525)
                                                             ( 2, 433)
                                                                         (50. 79)
                -Medical                                                 (20.41)
        16 Patient's Rental Equip.16, 686  25, 302
                                           10, 920
                                                     20, 240
                                                     10, 409
                                                               5, 062     25 . 01
       17 Occupancy              8, 692                         511       4. 91
        18 Telephone             3, 153     2, 918              665)
        19 Postage               1 , 698    1, 146
                                                      3, 583
                                                      1, 458
                                                                         (18.56)
                                                                 312)    (21.40)
        20 Auditing & Professional1, 000    2, 392    2 , 135    257      12.04
        21 Data Processing Equipment
        22 Conferences, Conventions
                                 2, 933     3, 678    3, 968    290)     ( 7.31)
           and Meetings          1, 472     2, 859    3, 333    474)     (14.22)
        23 Depreciation-Furniture
           and Fixtures            842        604    1, 550      946)    (61.03)
        24 Insurance               719      3, 661    4, 166    505 )    (12. 12)
        25 Community Education     -0-        68
        26 Other
                                                             ( 4, 098)   (98. 37)
                                 1 , 219
                                                      4, 166
                                            1, 790    1, 249     541      43. 31
        27 Organization Dues       392        400      500       100)    (20.00)
      39 TOTAL OPERATING EXPENSE$227, 479 $291, 954$303, 984 $ (12, 030) ( 3.96)
      40 NET INCOME (LOSS)
         FROM OPERATING        $ 8,404    $ 15, 387$ 10, 420$ 4, 967      47. 67
      50 United Way Income     $ 27, 917  $ 18, 713$ 27, 917 $ ( 9, 204)
      51 Indigent Care          28 , 601
                                                                         (32.97)
                                           27 , 917
                                          $ (9, 204)
                                                     27, 917    -0-
      52 Net Difference        $ ( 684)            $   -0-  $ ( 9, 204)
      53 NET INCOME OR (LOSS)  $ 7, 720   $ 6, 183 $ 10, 420 $ ( 4, 237) (40.66)
                   Source: https://www.industrydocur

Question: What financial statement is it?

Directly extract the answer of the question from the document with as few words as possible.

Answer:

Example 3

You are asked to answer questions asked on a document image.
The answers to questions are short text spans taken verbatim from the document. This means that the answers comprise a set of contiguous text tokens present in the document.
Document:
                                Food Table for "The Way to a Man's Heart"
                Food        Amount      Cal       Pro      Fat       CHO      SF      MF       PF      Chol       Fe       Alc
                M, F, P       1 oz       55       8.1      2, 1      0.2     0.8      0.8     0.2         39     1. ]
                LF Dairy    1 serv      210       9.4     10 .0     21.0     5.8      2. 6    0.6         33     0.4
                Eggs         3/wk        35       2.8      2. 4      0. 2    0. 7     1.0     0.3        108     0.5
                Fats+Oils   1 TB         94       0.3     10.3       0.5     1 .7     3.4     4.8         1
                Breads,     1 serv       70       2. 7     0 . 7    13.7     0 .1     0 . 2   0:3         tr     1.0
                Cereals
                Fruits       1 serv      54       0.5      0.2      13.9                      0. 2               0.6
                Vegetables 1 serv        35       1.6      0.9      5.9     0.2      0. 1     0.6                0. 7
                Desserts,   1 serv      124      1.0       2.2     18.0     0 . 4     0.8     1. 2               0.2      5.5
                Bev, Swts
                                         Source: https://www.industrydocuments.ucsf.edu/docs/qnmf0227

Question: How much is the iron (Fe) content in one serving of vegetables?

Directly extract the answer of the question from the document with as few words as possible.

Answer:

Example4

You are asked to answer questions asked on a document image.
The answers to questions are short text spans taken verbatim from the document. This means that the answers comprise a set of contiguous text tokens present in the document.
Document:
                                    Dale D. Hoskins, Ph.D., Department of Biochemistry,
                                  was advanced at the beginning of the year from Associate
                                  Scientist to Scientist. Dr. Hoskins received his doctorate
                                  from the University of Colorado School of Medicine in
                                  1960 and joined the Center in 1961. He received his M.S.
                                  in biochemistry from Oregon State University in 1955.
                                   E. Rene Casillas, Ph.D., was promoted from postdoctorate
                                  fellow in the Department of Biochemistry to Assistant Sci-
                                  entist in January. Dr. Casillas came to the Center in 1968
                                  from Oregon State University, where he received his Ph.D.,
                                  and where he was assistant in the Department of Bio-
                                  chemistry and Biophysics.
         APPOINTMENTS               Mary Bell, Ph.D., has been promoted from Assistant
                                  Scientist to Associate Scientist in the Department of Cutane-
                1971
                                 ous Biology, effective in May. Dr. Bell joined the Center
                                  staff in 1964 directly from Yale University, where she com-
                                 pleted a U.S. Public Health Service Predoctoral Training
                                  Fellowship in Anatomy.
                                    Marjorie LaSalle, Ph.D., who has been an Assistant
                                  Scientist at the Center since 1963, this month was named
                                  Associate Scientist in Hematology. A graduate of Oregon
                                  State University, she earned her M.S., in Microbiology at
                                  the University of Oregon Medical School and her Ph. D.
                                  from Stanford.
                                    John A. Resko, Ph.D., was promoted to Scientist in the
                                  Department of Reproductive Physiology and Behavior,
                                  effective in May. Before coming to the Center as an Assist-
                                 ant Scientist in 1964, he was a postdoctoral fellow at the
                                  University of Utah. His major interest was steroid bio-
                                  chemistry. He received his doctorate in physiology from the
                                  University of Illinois in 1963.
       32

Question: Who joined the Center in 1963?

Directly extract the answer of the question from the document with as few words as possible.

Answer:

Example5

You are asked to answer questions asked on a document image.
The answers to questions are short text spans taken verbatim from the document. This means that the answers comprise a set of contiguous text tokens present in the document.
Document:
               Mr. Oley M. Cummer
               Born June 9, 1889
               Attended Fort Collins Grammar School from 1895 to 1905
                        Colorado A. & M. Prep from 1905 to 1907
                        Colorado A. & M. from 1907 to 1911 and graduated with a B.S. degree,
                        majoring in Mechanical Engineering
               Permanent Employment date: July 1, 1911
               1908-1911 - Machinist during school -   Fort Collins
               July 1911-1914 - Student Technician -
               Sept. 1941-Oct. 1915 - General Foreman -
               Oct. 1915 to June 1916 - Asst. Supt.  - Windsor
               June 1916 to Sept. 1916 -- Worked for Spreckels for G.W.S.Co. checking pulp drier
                                         installation
               Sept. 1916 to Feb. 17 - Operated Pulp Drier - Gering, Nebraska
               Feb. 1917 to July 1918 - Consultant & operator of pulp engineering & installer
                                        out of Denver
               July 1918 to May 1920 - Traveling - Asst. to Schaffer, Gen. Supt.
               May 1920 to May 1926 - Supt., Brush - (Pennant 1920)
               May 1926 to May 1931 - Supt., Ovid
               May 1931 to Aug. 1936 - Asst. Supt., Scottsbluff
               Aug. 1936 to June 1940 - Supt. , Wheatland
               June 1940 to June 1943 - Supt. , Lyman
               Date of transfer to Scottsbluff as Superintendent not listed.
               Mr. Cummer was due to retire July 1, 1954, but died June 1, 1954.
                      Source: https://www.industrydocuments.ucsf.edu/docs/zphk0226

Question: When was Mr. Oley M. Cummer born?

Directly extract the answer of the question from the document with as few words as possible.

Answer:
WenjinW commented 1 year ago

Here are the explanations for your questions:

  1. The OCR results of DocVQA provided by the RRC leaderboard contain both line level and word level information. We adopt the line level information in LAYOUT_RECOVER.
  2. For users without a Claude account, we have provided an implementation based on Azure OpenAI GPT-3.5-turbo here. We add examples with GPT-3.5-turbo here. For more details about OpenAI, please refer to here.
  3. The prompts provided above all look correct. It is noteworthy that: a) When using GPT-3.5-turbo, please set the temperature to 0. Other settings can refer here. b) You can use the anls.py we provide to calculate the score. Note that since it is easy to encounter network problems when accessing the API, we regularly store the predicted results and support reading the stored result files when restarting the experiment to continue predicting subsequent samples. Therefore, when starting a new experiment, please delete the old result file or distinguish the experiments with different experiment names.

Hope the above content is helpful to you and please feel free to discuss them further.

ajskdlf64 commented 1 year ago

Thanks for the update! When I run the script below, I get the following error, what is the solution?

And one more question, is there a process to infer the prompt using gpt and post-process it afterwards?

bash script/claude_eval.sh 0 gpt-35 docvqa task_instruction_space
Traceback (most recent call last):
  File "examples/claude_docvqa.py", line 22, in <module>
    from utils import claude, space_layout, openai_api
  File "./utils/claude.py", line 6, in <module>
    c = anthropic.Client(os.environ["ANTHROPIC_API_KEY"])
  File "/root/anaconda3/envs/cream/lib/python3.8/os.py", line 673, in __getitem__
    raise KeyError(key) from None
KeyError: 'ANTHROPIC_API_KEY'
WenjinW commented 1 year ago

The reason for this error is that the environment variable ANTHROPIC_API_KEY is not set. We have updated the code. Now, you only need to set the environment variables OPENAI_API_KEY and OPENAI_API_BASE to use GPT-3.5-turbo for inference normally.

Hope the above content is helpful to you and please feel free to discuss them further.

ajskdlf64 commented 1 year ago

I am experiencing difficulties in reproducing your paper using the official Python OpenAI library. The reproduction process is not functioning correctly. I have a few questions and concerns regarding this matter:

  1. We found that the experiment can not be reproduced with the official Python OpenAI library. Can you provide any insight into this discrepancy?

  2. According to the OPENAI API Reference, the "completions" parameter is only applicable to older models such as "text-davinci-003." The model you utilized, "gpt-3.5-turbo," should use "chatCompletions" instead of "completions." Could you clarify why this difference exists?

  3. Additionally, it appears that the correct model name is "gpt-3.5-turbo" (with a hyphen), not "gpt-35-turbo" as mentioned in the paper. Moreover, the argument to use in the chat completion should be "model" instead of "engine," as indicated in the latest version of the library. Can you confirm these changes?

  4. I attempted to reproduce the provided example using "gpt-3.5-turbo" and the official OpenAI library. However, the performance I achieved on the test dataset was 0.58. Could you offer any insights into why my results differ from yours?

  5. Have you attempted to reproduce the experiment using the official OpenAI library? I'm curious to know if the "LATIN-prompt" used in your paper might be overfitted to the MS Azure OpenAI model.

PS.1) Reproducing the execution by running the script in the main branch is challenging due to dependencies like "wandb." It would be greatly appreciated if you could address this aspect as well.

PS.2) My goal is to successfully reproduce your research and achieve a score of 0.81 for submission to RRC.

Your innovative and groundbreaking work has left a strong impression on me. I truly value your prompt responses and kind explanations.

Thanks.

WenjinW commented 1 year ago
  1. I think the main reason for your difficulties lies in the differences between Azure OpenAI and official OpenAI. Due to regional restrictions, we cannot obtain the official OpenAI API Key. We are trying to get the corresponding services, but unfortunately, so far, we can only develop based on Azure OpenAI. The differences between Azure OpenAI and official OpenAI can be found here.
  2. gpt-35-turbo is the model corresponding to official OpenAI's gpt-3.5-turbo provided in Azure OpenAI. Details can be found here.
  3. We use "completions" instead of "chatCompletions" because our task does not involve dialogue.
  4. We specify the engine instead of the model because we use Azure OpenAI. The usage differences between Azure OpenAI and official OpenAI can be found here.
  5. Commenting out lines 18, 174, 242, and 267 in this file can avoid the influence of wandb. We will consider removing wandb later to avoid its impact.
  6. If you want to use the official OpenAI, we recommend using the "completions" interface and the gpt-3.5-turbo model. Please note that OpenAI recently updated the gpt-3.5-turbo model. You can compare the differences between gpt-3.5-turbo-0301 and gpt-3.5-turbo-0613. LATIN-Prompt has different effects on different language models. However, 0.58 seems a bit too low. If convenient, please provide your openai_completion function based on the official OpenAI implementation.

Hope the above content is helpful to you and please feel free to discuss them further.

ajskdlf64 commented 1 year ago

I understood this to mean to use the official OPENAI API, as a combination of "gpt-3.5-turbo" and "completions".

However, that combination is no longer supported by the official.

I've implemented openai_completion and openai_chat_completion like below by referring to LATIN's code.

When using "gpt-3.5-turbo", "openai_completions" throws an error, and when using "openai_chat_completions", a performance score of 0.58 was recorded.

image

import openai
openai.api_key = "---"

def openai_completion(prompt):

    while True:
        try:
            response = openai.Completion.create(
                model="gpt-3.5-turbo",
                prompt=prompt,
                max_tokens=200,
                temperature=0,
                stop="\n\n",
            )
            text = response['choices'][0]['text']
            break
        except TypeError:
            print(f"TypeError, maybe the prompt is too long: {len(prompt)}. Redeucing the prompt length.")
            if len(prompt) > 4000:
                question_idx = prompt.rfind("\n\nQuestion:")
                prompt = prompt[:4000-len(prompt[question_idx:])] + prompt[question_idx:]
            else:
                question_idx = prompt.rfind("\n\nQuestion:")
                prompt = prompt[:question_idx-300] + prompt[question_idx:]
            continue

    return text

def openai_chat_completion(prompt):

    messages = [{
        "role": "user",
        "content": prompt,
    }]
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages,
        max_tokens=200,
        temperature=0,
        stop="\n\n",
    )
    text = response['choices'][0]['message']['content']

    return text

completions

openai_completion(prompt="Write a tagline for an ice cream shop. ")
InvalidRequestError: This is a chat model and not supported in the v1/completions endpoint. Did you mean to use v1/chat/completions?

chatCompletions

openai_chat_completion(prompt="Write a tagline for an ice cream shop. ")
'"Scoops of happiness in every cone!"'
WenjinW commented 1 year ago

We are trying to adapt LATIN-Prompt to the latest chatCompletions interface and analyze the differences between it and the completions interface. An alternative approach to verifying the effect of LATIN-Prompt is to use older models like text-davinci-003.

Thank you for your support of our project. We will give you feedback as soon as we complete the adaptation.

WenjinW commented 1 year ago

@ajskdlf64 Hello, we recently tried the chatCompletions with the LATIN-Prompt and the result was around 0.58. We also conducted some preliminary experiments and found that the system message has a significant impact on the experimental results. We believe that the difference in performance between chatCompletions and completions is mainly due to the system message, while our method does not take into account the impact of it. We are analyzing the differences in prompts between system message and user content. You can also try some attempts from this perspective.

ajskdlf64 commented 1 year ago

I commend the authors for their dedicated research and passion. I have indeed noticed that the performance can vary depending on the version and usage of OpenAI's API.

I encourage further research to advance into higher-quality papers.

Thank you!