PKU-Alignment / align-anything

Align Anything: Training All-modality Model with Feedback
Apache License 2.0
101 stars 27 forks source link

Add Benchmarks #47

Open ChangranXU opened 4 weeks ago

ChangranXU commented 4 weeks ago

Description

Add AGIEval, C-Eval, TMMLU, SST-2 benchmarks with vllm.

Motivation and Context

According to the documentation, I add more benchmarks as requested.

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Checklist

Go over all the following points, and put an x in all the boxes that apply. If you are unsure about any of these, don't hesitate to ask. We are here to help!

Reindulger commented 4 weeks ago

Our evaluation framework has just been updated, so you can adjust the benchmark fit a little bit according to the new framework.

ChangranXU commented 4 weeks ago

Our evaluation framework has just been updated, so you can adjust the benchmark fit a little bit according to the new framework.

Sure, I have made the modification.

Move all yaml files to path align_anything/configs/evaluation/benchmarks

Thanks for your advice.

XuyaoWang commented 4 weeks ago

@ChangranXU It shows here that there are conflicting files in your current PR that have not been resolved.

ChangranXU commented 4 weeks ago

@ChangranXU It shows here that there are conflicting files in your current PR that have not been resolved.

It is the main file. As required in documentation, it should be modified after adding the additional benchmarks.

ChangranXU commented 4 weeks ago

@ChangranXU It shows here that there are conflicting files in your current PR that have not been resolved.

It conflicts because I modify on the main fork last night and do not update it today. I have made local correction, and submit the commit request.

zmsn-2077 commented 3 weeks ago

Hi, @ChangranXU , If you want us to review the content of PR again, please comment in time and @Reindulger again.

ChangranXU commented 2 weeks ago

Hi, @ChangranXU , If you want us to review the content of PR again, please comment in time and @Reindulger again.

@Reindulger all newly added benchmarks have been validated.

ChangranXU commented 1 week ago

@Reindulger @zmsn-2077 All benchmarks have been validated on all tasks. Only problem is <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

C-Eval | https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam | ['accountant', 'advanced_mathematics', 'art_studies', 'basic_medicine', 'business_administration', 'chinese_language_and_literature', 'civil_servant', 'clinical_medicine', 'college_chemistry', 'college_economics', 'college_physics', 'college_programming', 'computer_architecture', 'computer_network', 'discrete_mathematics', 'education_science', 'electrical_engineer', 'environmental_impact_assessment_engineer', 'fire_engineer', 'high_school_biology', 'high_school_chemistry', 'high_school_chinese', 'high_school_geography', 'high_school_history', 'high_school_mathematics', 'high_school_physics', 'high_school_politics', 'ideological_and_moral_cultivation', 'law', 'legal_professional', 'logic', 'mao_zedong_thought', 'marxism', 'metrology_engineer', 'middle_school_biology', 'middle_school_chemistry', 'middle_school_geography', 'middle_school_history', 'middle_school_mathematics', 'middle_school_physics', 'middle_school_politics', 'modern_chinese_history', 'operating_system', 'physician', 'plant_protection', 'probability_and_statistics', 'professional_tour_guide', 'sports_science', 'tax_accountant', 'teacher_qualification', 'urban_and_rural_planner', 'veterinary_medicine'] | test split without label -- | -- | -- | --

XuyaoWang commented 2 days ago

@Reindulger @zmsn-2077 All benchmarks have been validated on all tasks. Only problem is

C-Eval https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam ['accountant', 'advanced_mathematics', 'art_studies', 'basic_medicine', 'business_administration', 'chinese_language_and_literature', 'civil_servant', 'clinical_medicine', 'college_chemistry', 'college_economics', 'college_physics', 'college_programming', 'computer_architecture', 'computer_network', 'discrete_mathematics', 'education_science', 'electrical_engineer', 'environmental_impact_assessment_engineer', 'fire_engineer', 'high_school_biology', 'high_school_chemistry', 'high_school_chinese', 'high_school_geography', 'high_school_history', 'high_school_mathematics', 'high_school_physics', 'high_school_politics', 'ideological_and_moral_cultivation', 'law', 'legal_professional', 'logic', 'mao_zedong_thought', 'marxism', 'metrology_engineer', 'middle_school_biology', 'middle_school_chemistry', 'middle_school_geography', 'middle_school_history', 'middle_school_mathematics', 'middle_school_physics', 'middle_school_politics', 'modern_chinese_history', 'operating_system', 'physician', 'plant_protection', 'probability_and_statistics', 'professional_tour_guide', 'sports_science', 'tax_accountant', 'teacher_qualification', 'urban_and_rural_planner', 'veterinary_medicine'] test split without label

The official CEval dataset on Hugging Face includes labels. CEval uses the validation split for evaluation, as mentioned in their official repository. Furthermore, if you have any questions or requests, please don't hesitate to tag @XuyaoWang. I'll do my best to respond promptly.

ChangranXU commented 2 days ago

@Reindulger @zmsn-2077 All benchmarks have been validated on all tasks. Only problem is

C-Eval https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam ['accountant', 'advanced_mathematics', 'art_studies', 'basic_medicine', 'business_administration', 'chinese_language_and_literature', 'civil_servant', 'clinical_medicine', 'college_chemistry', 'college_economics', 'college_physics', 'college_programming', 'computer_architecture', 'computer_network', 'discrete_mathematics', 'education_science', 'electrical_engineer', 'environmental_impact_assessment_engineer', 'fire_engineer', 'high_school_biology', 'high_school_chemistry', 'high_school_chinese', 'high_school_geography', 'high_school_history', 'high_school_mathematics', 'high_school_physics', 'high_school_politics', 'ideological_and_moral_cultivation', 'law', 'legal_professional', 'logic', 'mao_zedong_thought', 'marxism', 'metrology_engineer', 'middle_school_biology', 'middle_school_chemistry', 'middle_school_geography', 'middle_school_history', 'middle_school_mathematics', 'middle_school_physics', 'middle_school_politics', 'modern_chinese_history', 'operating_system', 'physician', 'plant_protection', 'probability_and_statistics', 'professional_tour_guide', 'sports_science', 'tax_accountant', 'teacher_qualification', 'urban_and_rural_planner', 'veterinary_medicine'] test split without label

The official CEval dataset on Hugging Face includes labels. CEval uses the validation split for evaluation, as mentioned in their official repository. Furthermore, if you have any questions or requests, please don't hesitate to tag @XuyaoWang. I'll do my best to respond promptly.

@XuyaoWang Yes, I follow the official repo and uses validation split and dev split. I mentioned the question just becuase the val and dev split is very small compared to other bencnmarks.

XuyaoWang commented 2 days ago

@Reindulger @zmsn-2077 All benchmarks have been validated on all tasks. Only problem is

C-Eval https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam ['accountant', 'advanced_mathematics', 'art_studies', 'basic_medicine', 'business_administration', 'chinese_language_and_literature', 'civil_servant', 'clinical_medicine', 'college_chemistry', 'college_economics', 'college_physics', 'college_programming', 'computer_architecture', 'computer_network', 'discrete_mathematics', 'education_science', 'electrical_engineer', 'environmental_impact_assessment_engineer', 'fire_engineer', 'high_school_biology', 'high_school_chemistry', 'high_school_chinese', 'high_school_geography', 'high_school_history', 'high_school_mathematics', 'high_school_physics', 'high_school_politics', 'ideological_and_moral_cultivation', 'law', 'legal_professional', 'logic', 'mao_zedong_thought', 'marxism', 'metrology_engineer', 'middle_school_biology', 'middle_school_chemistry', 'middle_school_geography', 'middle_school_history', 'middle_school_mathematics', 'middle_school_physics', 'middle_school_politics', 'modern_chinese_history', 'operating_system', 'physician', 'plant_protection', 'probability_and_statistics', 'professional_tour_guide', 'sports_science', 'tax_accountant', 'teacher_qualification', 'urban_and_rural_planner', 'veterinary_medicine'] test split without label

The official CEval dataset on Hugging Face includes labels. CEval uses the validation split for evaluation, as mentioned in their official repository. Furthermore, if you have any questions or requests, please don't hesitate to tag @XuyaoWang. I'll do my best to respond promptly.

@XuyaoWang Yes, I follow the official repo and uses validation split and dev split. I mentioned the question just becuase the val and dev split is very small compared to other bencnmarks.

Thanks your contribution. Does this mean that you have completed the integration of all benchmarks? If so, after you rebase the current main branch of your repository, we can proceed with the code review.

ChangranXU commented 2 hours ago

@Reindulger @zmsn-2077 All benchmarks have been validated on all tasks. Only problem is

C-Eval https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam ['accountant', 'advanced_mathematics', 'art_studies', 'basic_medicine', 'business_administration', 'chinese_language_and_literature', 'civil_servant', 'clinical_medicine', 'college_chemistry', 'college_economics', 'college_physics', 'college_programming', 'computer_architecture', 'computer_network', 'discrete_mathematics', 'education_science', 'electrical_engineer', 'environmental_impact_assessment_engineer', 'fire_engineer', 'high_school_biology', 'high_school_chemistry', 'high_school_chinese', 'high_school_geography', 'high_school_history', 'high_school_mathematics', 'high_school_physics', 'high_school_politics', 'ideological_and_moral_cultivation', 'law', 'legal_professional', 'logic', 'mao_zedong_thought', 'marxism', 'metrology_engineer', 'middle_school_biology', 'middle_school_chemistry', 'middle_school_geography', 'middle_school_history', 'middle_school_mathematics', 'middle_school_physics', 'middle_school_politics', 'modern_chinese_history', 'operating_system', 'physician', 'plant_protection', 'probability_and_statistics', 'professional_tour_guide', 'sports_science', 'tax_accountant', 'teacher_qualification', 'urban_and_rural_planner', 'veterinary_medicine'] test split without label

The official CEval dataset on Hugging Face includes labels. CEval uses the validation split for evaluation, as mentioned in their official repository. Furthermore, if you have any questions or requests, please don't hesitate to tag @XuyaoWang. I'll do my best to respond promptly.

@XuyaoWang Yes, I follow the official repo and uses validation split and dev split. I mentioned the question just becuase the val and dev split is very small compared to other bencnmarks.

Thanks your contribution. Does this mean that you have completed the integration of all benchmarks? If so, after you rebase the current main branch of your repository, we can proceed with the code review.

@XuyaoWang Done.