LoveFishoO / 2024-KDD-WhoIsWho

Solution for 2024 KDD-WhoIsWho Top2
GNU General Public License v3.0
5 stars 1 forks source link

Introduction

Team: LoveFishO

Rank: 2

LoveFishO: Algorithm Engineer from NingBo

Report: https://openreview.net/pdf?id=oxOEqVH4tI

Architecture

imgae

Prerequisites

Hardware device

Getting Started

Dir description

IND-WhoIsWho: Store raw data

out_data: Store intermediate data generated by the program

output: Store all result data

Download data from HERE. The code is kvcq

Clone this repo

git clone https://github.com/LoveFishoO/2024-KDD-WhoIsWho.git
cd 2024-KDD-WhoIsWho/

Installation

pip install -r requirements.txt

Run

Encode

Inculde three embedding models

  1. multilingual-e5-large-instruct (title, abstruct, venue)
  2. voyage-large-2-instruct (title, abstruct, venue)
  3. bge-m3(orgs)
python3 encode.py --api_key "The api key of voyageai"

To reproduce the results, I recommend to download embedding data from HERE. The code is jd9u

Run LGB & Save feature dataframe

cd LGB

For E5-Instruct

python3 ./e5_instruct_lgb.py

For Voyage-Instruct

python3 ./voyage_lgb.py

The path of feature data is in ./out_data

Build Graph data

cd ..
cd GCN

For E5-Instruct embedding + E5-Instruct LGB features

python3 ./build_graph.py \ 
    --title_embeddings_dir ../out_data/e5_instruct_title_data.pkl \
    --abstract_embeddings_dir ../out_data/e5_instruct_abstract_data.pkl \
    --venue_embeddings_dir ../out_data/e5_instruct_venue_data.pkl \
    --train_feats_dir ../out_data/e5_instruct_train.csv \
    --test_feats_dir ../out_data/e5_instruct_test.csv  \
    --save_train_dir ../out_data/e5_instruct_graph_train.pkl \
    --save_test_dir ../out_data/e5_instruct_graph_test.pkl

For E5-Instruct embedding + Voyage LGB features

python3 ./build_graph.py \ 
    --title_embeddings_dir ../out_data/e5_instruct_title_data.pkl \
    --abstract_embeddings_dir ../out_data/e5_instruct_abstract_data.pkl \
    --venue_embeddings_dir ../out_data/e5_instruct_venue_data.pkl \
    --train_feats_dir ../out_data/voyage_train.csv \
    --test_feats_dir ../out_data/voyage_test.csv  \
    --save_train_dir ../out_data/e5_instruct_embed_voyage_feats_graph_train.pkl \
    --save_test_dir ../out_data/e5_instruct_embed_voyage_feats_graph_test.pkl

For Voyage embedding + Voyage LGB features

python3 ./build_graph.py \ 
    --title_embeddings_dir ../out_data/voyage_title_data.pkl \
    --abstract_embeddings_dir ../out_data/voyage_abstract_data.pkl \
    --venue_embeddings_dir ../out_data/voyage_venue_data.pkl \
    --train_feats_dir ../out_data/voyage_train.csv \
    --test_feats_dir ../out_data/voyage_test.csv  \
    --save_train_dir ../out_data/voyage_graph_train.pkl \
    --save_test_dir ../out_data/voyage_graph_test.pkl

For Voyage embedding + E5-Instruct LGB features

python3 ./build_graph.py \ 
    --title_embeddings_dir ../out_data/voyage_title_data.pkl \
    --abstract_embeddings_dir ../out_data/voyage_abstract_data.pkl \
    --venue_embeddings_dir ../out_data/voyage_venue_data.pkl \
    --train_feats_dir ../out_data/e5_instruct_train.csv \
    --test_feats_dir ../out_data/e5_instruct_test.csv  \
    --save_train_dir ../out_data/voyage_embed_e5_instruct_feats_graph_train.pkl \
    --save_test_dir ../out_data/voyage_embed_e5_instruct_feats_graph_test.pkl

Run GCN

For E5-Instruct embedding + E5-Instruct LGB features

python3 ./train.py \
    --train_dir ../out_data/e5_instruct_graph_train.pkl \
    --test_dir ../out_data/e5_instruct_graph_test.pkl \
    --save_result_dir ../output/e5_instruct_gcn.json

For E5-Instruct embedding + Voyage LGB features

python3 ./train.py \
    --train_dir ../out_data/e5_instruct_embed_voyage_feats_graph_train.pkl \
    --test_dir ../out_data/e5_instruct_embed_voyage_feats_graph_test.pkl \
    --save_result_dir ../output/e5_instruct_embed_voyage_feats_gcn.json

For Voyage embedding + Voyage LGB features

python3 ./train.py \
    --train_dir ../out_data/voyage_graph_train.pkl \
    --test_dir ../out_data/voyage_graph_test.pkl \
    --save_result_dir ../output/voyage_gcn.json

For Voyage embedding + E5-Instruct LGB features

python3 ./train.py \
    --train_dir ../out_data/voyage_embed_e5_instruct_feats_graph_train.pkl  \
    --test_dir ../out_data/voyage_embed_e5_instruct_feats_graph_test.pkl \
    --save_result_dir ../output/voyage_embed_e5_instruct_feats_gcn.json

Note: please use CPU to train model.

Inference

LGB

cd LGB

For E5-Instruct

python3 ./inference.py \ 
    --model e5_instruct \ 
    --test_path ../out_data/e5_instruct_lgb_test.csv \ 
    --test_author_path ../IND-WhoIsWho/ind_test_author_submit.json \ 
    --result_path ../output/e5_instruct_lgb.json \ 
    --model_dir ./lgb_model/

For Voyage-Instruct

python3 ./inference.py \ 
    --model voyage \ 
    --test_path ../out_data/voyage_lgb_test.csv \ 
    --test_author_path ../IND-WhoIsWho/ind_test_author_submit.json \ 
    --result_path ../output/voyage_lgb.json \ 
    --model_dir ./lgb_model/

GCN

For E5-Instruct embedding + E5-Instruct LGB features

python3 ./inference.py \
    --test_dir ../out_data/e5_instruct_graph_test.pkl \
    --model_path ./graph_model/e5_instruct_gcn_model.pt \
    --save_result_dir ../output/e5_instruct_gcn.json

For E5-Instruct embedding + Voyage LGB features

python3 ./inference.py \
    --test_dir ../out_data/e5_instruct_embed_voyage_feats_graph_test.pkl \
    --model_path ./graph_model/e5_instruct_embed_voyage_feats_gcn_model.pt \
    --save_result_dir ../output/e5_instruct_embed_voyage_feats_gcn.json

For Voyage embedding + Voyage LGB features

python3 ./inference.py \
    --test_dir ../out_data/voyage_graph_test.pkl \
    --model_path ../out_data/voyage_gcn_model.pt \
    --save_result_dir ../output/voyage_gcn.json

For Voyage embedding + E5-Instruct LGB features

python3 ./inference.py \
    --test_dir ../out_data/voyage_embed_e5_instruct_feats_graph_test.pkl  \
    --model_path ../out_data/voyage_embed_e5_instruct_feats_gcn_model.pt \
    --save_result_dir ../output/voyage_embed_e5_instruct_feats_gcn.json

Ensemble

cd ..
python3 ensemble.py

Results on Test Set

Method AUC
LGB-Voyage 0.81433
LGB-E5-Instruct 0.81827
GCN-E5-Instruct 0.78082
LGB(E5-Instruct/Voyage) x 2 + GCN(E5-Instruct/Voyage) x 4 0.82486

Note: