[DS][BUG] ChatGPT 파싱 수정 요청

robert-min commented 5 months ago

📌 Description

ChatGPT응답에 따라 아래와 같은 경우 추가적인 처리가 필요함

현재 처리하는 코드

def extract_coord_keyword(content: str):
    from collections import defaultdict
    pattern = r"'(.*?)':\[(.*?\n)"

    # TODO : 좌표 추출 코드 수정
    matches = re.findall(pattern, content)
    all_coords = defaultdict(list)
    for name, coords in matches:
        coords = coords.replace("\n", "")[1:-1].split("],[")
        for idx, coord in enumerate(coords):
            if idx == len(coords) - 1:
                while coord[-1] == "]":
                    coord = coord[:-1]
            temp = list(map(int, coord.split(",")))
            all_coords[name].append(temp)
    return all_coords

예외 상황

위의 코드로 현재 처리하는 경우
- 괄호 안에 값들이 좌표별로 잘 들어 가있음
- 'hot air balloons': [[143,38,225,119], [198,49,274,132], [348,69,394,113]]\n
- 'river': [[354,187,637,423]]\n
- 'terraced fields': [[43,241,802,707]]"
에러가 발생하는 경우
- 좌표마다 괄호 두개를 해서 값을 보냄
- 'hot air balloons':[[58,31,139,85]], [[176,15,243,68]], [[282,7,324,39]]\n
- 'river':[[179,237,429,292]]\n
- 'terraced fields':[[88,308,394,600]]

🎈 Goal

$\tiny{구체적인\ 산출물을\ 포함한\ 목표를\ 작성해주세요.}$

프롬프팅을 수정해서 [[좌표값1], [좌표값2], [좌표값3]] 형식으로 값을 보내도록 수정 필요
아니면 저 에러가 발생하는 경우를 파싱할 수 있도록 코드 수정이 필요

둘 중 더 편한 방법으로 진행!!

✏️ Todo

$\tiny{목표\ 달성을\ 위해\ 해야할\ 일을\ 세부적으로\ 작성해주세요.}$

[x] 에러 상황 전달
[ ] 문제 해결

kimdoeon commented 5 months ago

📌 Description

프롬프트 수정. 10번 돌렸을 때 10번 모두 [[좌표값1], [좌표값2], [좌표값3]] 형식으로 출력됨.

수정사항

기존 프롬프트

You are an expert art historian with vast knowledge about artists throughout history who revolutionized their craft. You will begin by briefly summarizing the personal life and achievements of the artist. Then you will go on to explain the medium, style, and influences of their works. Then you will provide short descriptions of what they depict and any notable characteristics they might have. Fianlly identify THREE keywords in the picture and provide each coordinate of the keywords in the last sentence. For example if the keyword is woman, the output must be 'woman':[[x0,y0,x1,y1]]

수정 프롬프트

You are an expert art historian with vast knowledge about artists throughout history who revolutionized their craft. You will begin by briefly summarizing the personal life and achievements of the artist. Then you will go on to explain the medium, style, and influences of their works. Then you will provide short descriptions of what they depict and any notable characteristics they might have. Fianlly identify THREE keywords in the picture and provide each coordinate of the keywords in the last sentence. For example if the keyword is woman, the output must be 'woman':[[x0,y0,x1,y1]] or 'woman':[[x0,y0,x1,y1], [x2,y2,x3,y3]]

kimdoeon commented 5 months ago

📌 Description

response로 받은 raw content에서 개행/탭 제거하는 refine_ouput_first 함수 추가
키워드/좌표 추출 함수 extract_coord_keyword 수정

1번 수정사항

def refine_ouput_first(content: str) -> str:
  '''raw content에서 개행/탭 제거'''
  content = content.replace('\n', ' ').replace('\t', ' ').strip()

  return content

2번 수정사항

def extract_coord_keyword(content: str) -> dict[str, list[list[int]]]:
  chk = 'json'
  if chk in content:
      key_coord_dic = content.split(chk)[-1].strip() #json 기준 뒷부분 추출
      match = re.search(r'\{.*\}', key_coord_dic) # { } 안 문자열 추출 정규식
      if match: 
          str_dict = match.group()
          key_coord_dic = json.loads(str_dict) #json 형태 문자열 딕셔너리로 변환
  else:
      return {} #json 없을 경우 빈 딕셔너리 반환

  return key_coord_dic

간단한 test code

import re
import json

content = '''As an AI, I do not have access to specific databases for identifying individual artworks or artists beyond my training data, which only goes up until April 2023. Therefore, I cannot provide a personal history or achievements of the artist of this specific painting since it requires identifying individual living or recent artists, which I cannot do. However, I can describe the visible characteristics of this image.\n\nThe artwork displayed is an idyllic landscape painting that appears to employ a stylized realism. The medium looks like it could be acrylic or oil on canvas, given the vibrancy of the colors and the smooth texture of the painted surface. The style presents a harmonized composition with vibrant colors, and there\'s a certain rhythm created by the patterns of the fields. This style is reminiscent of folk art or naive art, which often features simplified forms and a sense of serenity.\n\nThe painting depicts a lush green landscape with a meandering river leading towards a tranquil blue lake. Terraced fields, perhaps indicative of rice paddies or tea plantations, add a patterned texture to the rolling hills. Trees intermittently dot the landscape, and the presence of hot air balloons in the sky introduces a whimsical or fantastical element to the scene. There\'s a structure visible to the left, possibly part of a house or an outbuilding with a red brick chimney and a white parasol, suggesting a human presence without showing actual figures.\n\nNow, for the coordinates of three keywords within the image:\n\n1. \'hot air balloon\',\n2. \'river\',\n3. \'terraced fields\'.\n\n```json\n{\n  "hot air balloon": [[74,35,117,84], [200,29,236,66], [411,43,442,69]],\n  "river": [[223,285,400,406]],\n  "terraced fields": [[0,228,600,477]]\n}\n```'''

def refine_ouput_first(content: str) -> str:
  '''raw content에서 개행/탭 제거'''
  content = content.replace('\n', ' ').replace('\t', ' ').strip()

  return content

def extract_coord_keyword(content: str) -> dict[str, list[list[int]]]:
  chk = 'json'
  if chk in content:
      key_coord_dic = content.split(chk)[-1].strip() #json 기준 뒷부분 추출
      match = re.search(r'\{.*\}', key_coord_dic) # { } 안 문자열 추출 정규식
      if match: 
          str_dict = match.group()
          key_coord_dic = json.loads(str_dict) #json 형태 문자열 딕셔너리로 변환
  else:
      return {} #json 없을 경우 빈 딕셔너리 반환

  return key_coord_dic

ref_content = refine_ouput_first(content )
key_coord_dic = extract_coord_keyword(ref_content)

print(key_coord_dic)

kimdoeon commented 5 months ago

📌 Description

출력 텍스트에 'json' 포함되지 않은 경우, json 형식을 따르지 않는 경우 발견 => 1. extract_coord_keyword 수정. => 2. 프롬프트 수정

1. extract_coord_keyword 수정

수정 :

content에서 바로 { }안 키워드 추출
str_dict의 ' -> " 로 replace

json 형식 따르지 않는 경우 무조건 빈 딕셔너리 반환

기존 코드


def extract_coord_keyword(content: str) -> dict[str, list[list[int]]]:
chk = 'json' #수정
if chk in content: #수정
key_coord_dic = content.split(chk)[-1].strip() #수정
match = re.search(r'\{.*\}', key_coord_dic) 
if match: 
  str_dict = match.group()
  key_coord_dic = json.loads(str_dict) 
else:
return {}

return key_coord_dic

- 수정 코드
```python
def extract_coord_keyword(content: str):
match = re.search(r'\{.*\}', content) 
if match: 
  str_dict = match.group()
  str_dict = str_dict.replace("'",'"')#수정
  try: #수정
      return json.loads(str_dict) 
  except json.JSONDecodeError:
      return {}
else:
  return {}

2. 프롬프트 수정

수정 프롬프트

'''"You are an expert art historian with vast knowledge about artists throughout history who revolutionized their craft.",
"You will begin by briefly summarizing the personal life and achievements of the artist.",
"Then you will go on to explain the medium, style, and influences of their works.",
"Then you will provide short descriptions of what they depict and any notable characteristics they might have.",
"Fianlly identify THREE keywords in the picture and provide each coordinate of the keywords in the last sentence.",
'For example, Give the coordinate value of the keywords in json format such as if the keyword is Pretty_woman, ```json{"pretty_woman", [[x0,y0,x1,y1]]}```, or if there are multiple coordinates, keyword coordinates in json format such as ```json{"pretty_woman":[[x0,y0,x1,y1], [x2,y2,x3,y3]]}`',
"The values entered in x0, y0, x1, y1 are unconditionally the coordinate values of each keyword."'''

kimdoeon commented 5 months ago

📌 Description

출력 형식이 수정되어 refine_output 함수 수정.

기존 : ' : ' 를 기준으로 : 뒷 문장들 제거
수정 : 전체 해설에서 정수가 포함된 문장 제거

기존 코드

def refine_output(content: str) -> str:
keyword = ':'
if keyword in content:
    content = content[:content.find(keyword)].strip()
content = content.replace('\n', ' ').strip()
return content

수정 코드

def refine_output(content: str) -> str:
output = ""
sentences = content.split(". ")
for sentence in sentences:
    if not re.search(r'\d',sentence):
        output+=sentence

if not output:
    return content    

else:
    return output

kimdoeon commented 5 months ago

📌 Description

refine_output 함수 수정. 해설 앞 부분 AI의 변명(I cannot, i do not ~) 제거, json, JSON, {, 정수 들어간 문장 제거.

def refine_output(content: str) -> str:
    # AI 변명, focus-pointing에서 걸러지지 않은 문장 제거
    words=["cannot", "AI", "do not", "can't", "json", "JSON", "{",]
    output = ""

    sentences = content.split(". ")
    for sentence in sentences:
        if not any(word in sentence for word in words) and not re.search(r'\d',sentence):
            output+=sentence

    if not output:
        return content

    else:
        return output

robert-min commented 5 months ago

240213 기준 프롬프팅

"You are an expert art historian with vast knowledge about artists throughout history who revolutionized their craft.",
"You will begin by briefly summarizing the personal life and achievements of the artist.",
"Then you will go on to explain the medium, style, and influences of their works.",
"Then you will provide short descriptions of what they depict and any notable characteristics they might have.",
"Fianlly identify THREE keywords in the picture and provide each coordinate of the keywords in the last sentence.",
"For example, Give the coordinate value of the keywords in json format.",
"if the keyword is pretty_woman and big_ball, value is  ```json{\"pretty_woman\", [[x0,y0,x1,y1]], \"big_ball\", [[x0,y0,x1,y1], [x2,y2,x3,y3]]}```",
"The values entered in x0, y0, x1, y1 are unconditionally the coordinate values of each keyword.",

FLYAI4 / focus-point-research