boostcampaitech7/level2-cv-datacentric-cv-07

🏆 다국어 영수증 OCR

[👀Model](#final-model) | [🤔Issues](https://github.com/boostcampaitech7/level2-objectdetection-cv-07/issues) | [🚀External Data](#external-data---cord)

Introduction

주로 AI 모델의 구조나 알고리즘에 집중하기 쉽지만, 실무에서는 데이터의 품질이 모델 성능만큼 중요합니다. 본 대회에서는 Data-Centric AI 접근 방식을 통해, 다국어(중국어, 일본어, 태국어, 베트남어) 영수증 이미지에서 글자를 검출하는 OCR 과제를 수행하고자 합니다.

Goal : 쓰레기 객체를 탐지하는 모델을 개발하여 정확한 분리수거와 환경 보호를 지원
Data : UFO 포맷의 글자가 포함된 JPG 이미지 (Train Data 총 400장, Test Data 총 120장)
Metric : DetEval(Final Precision, Final Recall, Final F1-Score)

Project Overview

초기 단계에서는 EDA와 베이스라인 코드 분석을 통해 데이터와 모델에 대한 기초적인 분석을 진행한 후, 외부 및 합성 데이터를 활용하고 데이터 클렌징과 증강 기법을 적용한 다양한 실험을 통해 모델의 일반화 성능을 최적화하였습니다. 최종적으로는 5-fold 앙상블을 적용하여 최적의 성능을 도출하였습니다.
결과적으로 precision:0.9427, recall:0.8801, f1:0.9103를 달성하여 리더보드에서 4위를 기록하였습니다.

Model

베이스라인 모델은 EAST (An Efficient and Accurate Scene Text Detector; Zhou et al., 2017)이고, Backbone로는 ImageNet에 사전훈련된 VGG-16 (Visual Geometry Group - 16 layers; Simonyan and Zisserman, 2015)을 사용합니다.

Data

dataset
  ├── chinese_receipt
      ├── img # train 및 test image
      └── ufo # train 및 test image에 대한 annotation file (ufo format)
  ├── japanese_receipt
      ├── img # train 및 test image
      └── ufo # train 및 test image에 대한 annotation file (ufo format)
  ├── thai_receipt
      ├── img # train 및 test image
      └── ufo # train 및 test image에 대한 annotation file (ufo format)
  └── vietnamese_receipt
      ├── img # train 및 test image
      └── ufo # train 및 test image에 대한 annotation file (ufo format)

User Guide

cd code # code 폴더로 이동
python train.py # 모델 학습 실행
python validate.py # 학습된 가중치를 불러와 validation 수행
python test.py # 가장 높은 validation 점수를 기록한 가중치를 불러와 test 데이터셋에 대한 추론 수행

File Tree

├── .github
├── external-data
    ├── cord-data
    ├── synthetic-data
├── code
    ├── model code
└── README.md

External Data - CORD

License and Data Attribution

This project uses the CORD (Consolidated OCR Dataset). The dataset is provided under the CORD license terms, and we adhere to these terms within this repository.

Attribution

Dataset Name: Consolidated OCR Dataset (CORD)
Provider: NAVER AI Lab
License: This dataset is provided under the terms specified in the CORD documentation.

For full details on the CORD license and permissions, please refer to the official CORD documentation.

Environment Setting

System Information		Tools and Libraries
Category	Details	Category	Details
Operating System	Linux 5.4.0	Git	2.25.1
Python	3.10.13	Conda	23.9.0
GPU	Tesla V100-SXM2-32GB	Tmux	3.0a
CUDA	12.2

Supported by Naver BoostCamp AI Tech.

👥 Team Members of LuckyVicky


🍀이동진	🍀정지환	🍀유정선	🍀신승철	🍀김소정	🍀서정연
서버 관리, Failure Analysis, 앙상블	데이터 전처리, Augmentation	EDA, 데이터 전처리, Augmentation	데이터 전처리, Augmentation	데이터 합성, 스케줄링, 문서화	외부 데이터셋 학습, 깃 관리