cvpaperchallenge / Crux

Crux is a suite of LLM-empowered summarization and retrieval services for academic activity. Crux is developed by XCCV group of cvpaper.challenge.
MIT License
15 stars 2 forks source link

Enhancement/update paper parse logics #34

Closed YoshikiKubotani closed 4 months ago

YoshikiKubotani commented 4 months ago

Issue URL

N/A

Change overview

How to test

[!Important] The following test CANNOT BE DONE if you are not an authorized member of cvpaper challenge who does not have access to the Amazon Secret Manager of our organization.

1. Prerequisite

  1. Prepare the API keys for OpenAI API and Qdrant Cloud

[!Important] Different from #33, you need to prepare your own Qdrant cloud cluster to test the paper parse & upload logics.

  1. Create your own environments/.env referring environments/.env.sample and specify the secret environment variables

    • You might be able to get AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY by asking for them to @gatheluck who manages the AWS account I used for the Crux application
    • You can get AWS_DEFAULT_REGION and DYNAMODB_TABLE_NAME from here
    OPENAI_API_KEY=<PLEASE_WRITE_YOUR_OPEN_API_KEY_HERE>
    
    # AWS service settings
    AWS_ACCESS_KEY_ID=<PLEASE_WRITE_YOUR_AWS_ACCESS_KEY_ID_HERE>
    AWS_SECRET_ACCESS_KEY=<PLEASE_WRITE_YOUR_AWS_SECRET_ACCESS_KEY_HERE>
    AWS_DEFAULT_REGION=<PLEASE_WRITE_YOUR_AWS_DEFAULT_REGION_HERE>
    DYNAMODB_TABLE_NAME=<PLEASE_WRITE_YOUR_DYNAMODB_TABLE_NAME_HERE>
    
    # Qdrant cloud settings
    QDRANT_CLOUD_URL=<PLEASE_WRITE_YOUR_QDRANT_CLOUD_URL_HERE>
    QDRANT_API_KEY=<PLEASE_WRITE_YOUR_QDRANT_API_KEY_HERE>
  2. Download the Mathpix format paper data that has been reviewed and corrected for any errors from here, and place them as below.

    ├─ data/
    │    ├─ papers/
    │    │    ├─ CVPR2023/
    │    │    │    ├─ 0000_GFPose_Learning_3D_Human_Pose_Prior_With_Gradient_Fields
    │    │    │    │    └─ Ci_GFPose_Learning_3D_Human_Pose_Prior_With_Gradient_Fields_CVPR_2023_paper_mathpix.txt
    │    │    │    │
    │    │    │    ├── 0001_CXTrack_Improving_3D_Point_Cloud_Tracking_With_Contextual_Information/
    │    │    │    │    └─ Xu_CXTrack_Improving_3D_Point_Cloud_Tracking_With_Contextual_Information_CVPR_2023_paper_mathpix.txt
    │    │    │    │
    │    │    │    ...
    │    │    │    └── 2352_Curvature-Balanced_Feature_Manifold_Learning_for_Long-Tailed_Classification/
    │    │    │         └─ Ma_Curvature-Balanced_Feature_Manifold_Learning_for_Long-Tailed_Classification_CVPR_2023_paper_mathpix.txt
    │    │    │
    │    │    └── ICCV2023/
    │    │         ├─ 0000_Towards_Attack-tolerant_Federated_Learning_via_Critical_Parameter_Analysis
    │    │         │    └─ Han_Towards_Attack-tolerant_Federated_Learning_via_Critical_Parameter_Analysis_ICCV_2023_paper_mathpix.txt
    │    │         │
    │    │         ├── 0001_Stochastic_Segmentation_with_Conditional_Categorical_Diffusion_Models/
    │    │         │    └─ Zbinden_Stochastic_Segmentation_with_Conditional_Categorical_Diffusion_Models_ICCV_2023_paper_mathpix.txt
    │    │         │
    │    │         ...
    │    │         └── 2155_PreSTU_Pre-Training_for_Scene-Text_Understanding/
    │    │              └─ Kil_PreSTU_Pre-Training_for_Scene-Text_Understanding_ICCV_2023_paper_mathpix.txt
    │    │
    │    └ README.md

2. Local Test

  1. [OPTIONAL] Remove the existing containers

    # Move to the directory that has `docker-compose.yaml`
    ~/Crux$ cd environments/cpu
    
    # Remove the existing docker containers
    ~/Crux/environments/cpu$ docker compose down
  2. [OPTIONAL] Boot up containers without using cache

    # Re-build docker images without using cache
    ~/Crux/environments/cpu$ docker compose build --no-cache
    
    # Boot up docker containers
    ~/Crux/environments/cpu$ docker compose up -d
  3. Run the parse & upload script

    ~/crux-backend$ poetry run python src/scripts/upload_paper_data.py -p data/papers

Note for reviewers

[!Note] Parsing would end in a minute, but it takes a few hours to upload the embedding vector to the Qdrant Cloud.

[!Caution] When you run the test, you will be charged for the OpenAI API's embedding model.