Enhancement/update paper parse logics

Issue URL

N/A

Change overview

Add more strict rules to the parse mmd logic
Complete update_paper_data.py implemetation

How to test

[!Important] The following test CANNOT BE DONE if you are not an authorized member of cvpaper challenge who does not have access to the Amazon Secret Manager of our organization.

1. Prerequisite

Prepare the API keys for OpenAI API and Qdrant Cloud

[!Important] Different from #33, you need to prepare your own Qdrant cloud cluster to test the paper parse & upload logics.

Create your own environments/.env referring environments/.env.sample and specify the secret environment variables

You might be able to get AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY by asking for them to @gatheluck who manages the AWS account I used for the Crux application
You can get AWS_DEFAULT_REGION and DYNAMODB_TABLE_NAME from here

OPENAI_API_KEY=<PLEASE_WRITE_YOUR_OPEN_API_KEY_HERE>

# AWS service settings
AWS_ACCESS_KEY_ID=<PLEASE_WRITE_YOUR_AWS_ACCESS_KEY_ID_HERE>
AWS_SECRET_ACCESS_KEY=<PLEASE_WRITE_YOUR_AWS_SECRET_ACCESS_KEY_HERE>
AWS_DEFAULT_REGION=<PLEASE_WRITE_YOUR_AWS_DEFAULT_REGION_HERE>
DYNAMODB_TABLE_NAME=<PLEASE_WRITE_YOUR_DYNAMODB_TABLE_NAME_HERE>

# Qdrant cloud settings
QDRANT_CLOUD_URL=<PLEASE_WRITE_YOUR_QDRANT_CLOUD_URL_HERE>
QDRANT_API_KEY=<PLEASE_WRITE_YOUR_QDRANT_API_KEY_HERE>

Download the Mathpix format paper data that has been reviewed and corrected for any errors from here, and place them as below.

├─ data/
│    ├─ papers/
│    │    ├─ CVPR2023/
│    │    │    ├─ 0000_GFPose_Learning_3D_Human_Pose_Prior_With_Gradient_Fields
│    │    │    │    └─ Ci_GFPose_Learning_3D_Human_Pose_Prior_With_Gradient_Fields_CVPR_2023_paper_mathpix.txt
│    │    │    │
│    │    │    ├── 0001_CXTrack_Improving_3D_Point_Cloud_Tracking_With_Contextual_Information/
│    │    │    │    └─ Xu_CXTrack_Improving_3D_Point_Cloud_Tracking_With_Contextual_Information_CVPR_2023_paper_mathpix.txt
│    │    │    │
│    │    │    ...
│    │    │    └── 2352_Curvature-Balanced_Feature_Manifold_Learning_for_Long-Tailed_Classification/
│    │    │         └─ Ma_Curvature-Balanced_Feature_Manifold_Learning_for_Long-Tailed_Classification_CVPR_2023_paper_mathpix.txt
│    │    │
│    │    └── ICCV2023/
│    │         ├─ 0000_Towards_Attack-tolerant_Federated_Learning_via_Critical_Parameter_Analysis
│    │         │    └─ Han_Towards_Attack-tolerant_Federated_Learning_via_Critical_Parameter_Analysis_ICCV_2023_paper_mathpix.txt
│    │         │
│    │         ├── 0001_Stochastic_Segmentation_with_Conditional_Categorical_Diffusion_Models/
│    │         │    └─ Zbinden_Stochastic_Segmentation_with_Conditional_Categorical_Diffusion_Models_ICCV_2023_paper_mathpix.txt
│    │         │
│    │         ...
│    │         └── 2155_PreSTU_Pre-Training_for_Scene-Text_Understanding/
│    │              └─ Kil_PreSTU_Pre-Training_for_Scene-Text_Understanding_ICCV_2023_paper_mathpix.txt
│    │
│    └ README.md

2. Local Test

[OPTIONAL] Remove the existing containers

# Move to the directory that has `docker-compose.yaml`
~/Crux$ cd environments/cpu

# Remove the existing docker containers
~/Crux/environments/cpu$ docker compose down

[OPTIONAL] Boot up containers without using cache

# Re-build docker images without using cache
~/Crux/environments/cpu$ docker compose build --no-cache

# Boot up docker containers
~/Crux/environments/cpu$ docker compose up -d

Run the parse & upload script

~/crux-backend$ poetry run python src/scripts/upload_paper_data.py -p data/papers

Note for reviewers

[!Note] Parsing would end in a minute, but it takes a few hours to upload the embedding vector to the Qdrant Cloud.

[!Caution] When you run the test, you will be charged for the OpenAI API's embedding model.

cvpaperchallenge / Crux

Enhancement/update paper parse logics #34