최종 결과용 실험 계획

GirinMan / HYU-Graduation-Project-Quantization

한양대학교 컴퓨터소프트웨어학부 졸업 프로젝트 진행용 레포지토리입니다.

Apache License 2.0

0 stars 0 forks source link

최종 결과용 실험 계획 #22

Open GirinMan opened 1 year ago

GirinMan commented 1 year ago

실험 계획

어떤 tuning 방법을 사용했을때 memory efficient한가?
어떤 quantization 방법을 사용했을 때 inference time에서 정확도가 높은가?

Finetuning 과정에서 메모리 사용량 비교군

Full finetuning
LoRA tuning
llm.int8() + LoRA tuning

Inference 시점에서 정확도 비교군

Full finetuning + dymamic quantization
LoRA + dynamic quantization
llm.int8() + LoRA tuning

Task 종류

GLUE: MRPC, RTE, COLA, STS-B
SAMsum(요약)

Task별 프롬프트 구성

https://www.promptingguide.ai/introduction/examples

vsj951 commented 1 year ago

GLUE benchmark 하루에 2번까지만 제출 가능하고 결과는 바로 확인 할수 있음

GirinMan commented 1 year ago

GLUE benchmark 하루에 2번까지만 제출 가능하고 결과는 바로 확인 할수 있음

@vsj951 그럼 인당 2번씩 하루에 총 6번 제출 가능할것 같고... 예시 제출 양식(아마 json 파일로 되어있을 것) 한번 찾아서 테스트로 제출 해볼 수 있어?

vsj951 commented 1 year ago

샘플 파일 제출 결과 Cap 2023-04-08 19-27-15-148

3번째 submission은 제한됨 Cap 2023-04-08 19-38-00-415

plaire48 commented 1 year ago

MRPC (분류)

템플릿

Classify Text1 and Text2 into equivalent or not_equivalent. 
Text1: 
Text2:
Answer:

예시

Classify Text1 and Text2 into equivalent or not_equivalent. 
Text1: "PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So ."
Text2: "Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So ."
Answer:

RTE (분류)

템플릿

Classify Text1 and Text2 into entailment or not_entailment. 
Text1:
Text2: 
Answer:

예시

Classify Text1 and Text2 into entailment or not_entailment. 
Text1: "No Weapons of Mass Destruction Found in Iraq Yet."
Text2: "Weapons of Mass Destruction Found in Iraq."
Answer:

COLA(분류)

템플릿

Classify the text into acceptable or unacceptable. 
Text: 
Answer:

예시

Classify the text into acceptable or unacceptable. 
Text: "Bill whistled past the house."
Answer:

STS-B(유사도)

템플릿

Write how similar text1 and text2 are with a real number between 0 and 5.
Text1: 
Text2: 
Answer:

예시

Write how similar text1 and text2 are with a real number between 0 and 5.
Text1: "A plane is taking off."
Text2: "An air plane is taking off."
Answer:

SAMsum(요약)

템플릿

Summarize the text in one sentence.
Text:
Answer:

예시

Summarize the text in one sentence.
Text: "Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: I'll bring you tomorrow :-)"
Answer:

GirinMan commented 1 year ago

최종 발표 시 넣어야 할 내용들

메인 주제: Consumer GPU에서 Billion scale의 LLM을 효과적으로 finetuning 가능하도록 하자!
llm.int8()과 bitsandbytes의 backward pass 동작 원리 그리고 finetuning에 부적절한 이유
lora tuning 적용 시 원본 모델 freeze하여 얻을 수 있는 이점들
pytorch dynamic quantization과 달리 bitsandbytes가 ampere 아키텍처(RTX 20 series) 이상 nvidia gpu에서 지원하는 int8 계산 지원을 활용하여 얻을 수 있는 차별점
LLM.int8() 원본 논문에서 제안하는 모델 크기에 따른 효과
1.3B 모델과 13B 모델의 메모리 절약 효과 차이
optimizer state의 메모리 사용량 어느정도 되는지
quantization으로 인해 절약할 수 있는 메모리 -> 더 큰 batch 사이즈 사용이 가능한가?

vsj951 commented 1 year ago

Lora tuning적용시 원본 모델 freeze 이점

1. finetuning과정에서 trainable parameter수를 줄여 메모리 사용량 감소

Full finetuning에서는 원본 모델의 모든 파라미터가 학습되어 메모리 사용량이 매우 크지만, lora tuning의 경우 원본 모델의 파라미터는 학습되지 않고 lora adapter의 파라미터만 학습되며 파라미터 개수는 원본 모델과 비교하여 매우 적으므로 fine tuning과정에서의 메모리 사용량이 크게 감소하는 효과를 얻을 수 있다.

2. finetuning모델의 공유 간편화

원본 모델의 weight가 freeze되어 있으므로 서로 다른 task로 finetuning된 모델 간의 차이점은 adapter의 weight뿐이다. 따라서 모델 전체가 아니라 학습된 adapter의 weight만 저장하고 불러오면 finetuning된 모델을 이용할 수 있다.

vsj951 commented 1 year ago

LLm.int8()논문에서 언급한 training, finetuning에서의 한계

6 Discussion and Limitations의 마지막 내용

A final limitation is that we focus on inference but do not study training or finetuning. We provide an initial analysis of Int8 finetuning and training at scale in Appendix E. Int8 training at scale requires complex trade-offs between quantization precision, training speed, and engineering complexity and represents a very difficult problem. We again leave this to future work.

Appedix E: Int8 training result

Cap 2023-04-24 21-02-00-095 209M, 1.1B모델에 대해서 training 진행 (pretrained 모델이 아닌 아예 weight가 초기화된 모델에서 시작) 8bit quantization을 적용한 module: FFN(feed forward network) , Linear(attention projection layer), Attention Decomp는 mixed precision decomposition을 적용한 parameter 비율, PPL(perplexity)은 모델 성능평가지표(낮을수록 좋음) FFN: 기존 pretrained 모델과 accuracy 차이 거의 없음 Linear: 209M 모델에서는 accuracy차이가 거의 없지만 1.1B모델에서는 유의미한 차이가 발생 Attention: accuracy 상당히 감소, mixed precision decomposition을 적용하였을 때 accuracy가 어느정도 증가하지만 원본 모델 수준으로 회복되지는 않음

Appedix F Table 9: Int8 finetuning result (LLm.int8 vs. other 8-bit quantization methods)

Cap 2023-04-24 21-02-29-063 대상 모델: RoBERTa-large feed forward layer만 8bit quantization, mixed precision decomposition 적용 안함, vector-wise quantization만 적용 다른 기법들과 비교하여 llm.int8()이 더 좋은 성능을 보임

Appedix F Table 10: Int8 finetuning result (quantization 조건 바꾸면서 비교)

Cap 2023-04-24 21-39-47-533 Linear(attention projection layer)에 quantization적용한 경우 유의미한 성능 저하 발생 Decomp(mixed precision decomposition)비율 증가할 수록 성능 상승하는 경향을 보임

개인적인 의견

Limitation에서 언급한 내용은 LLM.int8()이 inference에 중점을 두고 설계된 것이고 이 과정에서 training, finetuning에서의 성능에 대한 고려는 하지 않았다는 의미인 것 같다. 즉, training, finetuning에 적용하기 부적절하다기 보다는 성능이 검증되지 않았으니 연구가 필요하다는 의미라고 생각한다.
실험결과를 보니 training, finetuning에서 llm.int8()을 적용하는 것이 그렇게 나쁘지는 않다는 생각이 들었고, accuracy 외에도 학습 속도 등 다른 요소들도 분석해봐야 좋은지 나쁜지 알 수 있을 것 같다.

*참고

attention projection layer: attention layer에서 input벡터 -> Q, K, V벡터 만드는 linear layer

vsj951 commented 1 year ago

LLM.int8() 원본 논문에서 제안하는 모델 크기에 따른 효과

매우 큰 딥러닝 모델의 경우 outlier의 발생 빈도가 높음, 이러한 outlier에 대한 처리가 없는 기존 quantization방식을 적용할 경우 accuracy가 상당히 저하되지만 llm.int8()에서는 mixed precision decomposition기법을 적용하여 outlier는 full precision으로 계산하고 정상범위의 값에 대해서는 quantization을 적용하여 계산하므로 accuracy감소가 거의 없다.

vsj951 commented 1 year ago

GPT3, OPT모델 크기별 zero-shot accuracy

Cap 2023-04-25 00-32-52-351 Cap 2023-04-25 00-34-50-702

GPT3, OPT모델 크기별 multi-shot accuracy

Cap 2023-04-25 00-35-04-434 Cap 2023-04-25 00-34-54-862

GirinMan / HYU-Graduation-Project-Quantization

최종 결과용 실험 계획 #22

실험 계획

Finetuning 과정에서 메모리 사용량 비교군

Inference 시점에서 정확도 비교군

Task 종류

Task별 프롬프트 구성

최종 발표 시 넣어야 할 내용들

Lora tuning적용시 원본 모델 freeze 이점

1. finetuning과정에서 trainable parameter수를 줄여 메모리 사용량 감소

2. finetuning모델의 공유 간편화

LLm.int8()논문에서 언급한 training, finetuning에서의 한계

6 Discussion and Limitations의 마지막 내용

Appedix E: Int8 training result

Appedix F Table 9: Int8 finetuning result (LLm.int8 vs. other 8-bit quantization methods)

Appedix F Table 10: Int8 finetuning result (quantization 조건 바꾸면서 비교)

개인적인 의견

*참고

LLM.int8() 원본 논문에서 제안하는 모델 크기에 따른 효과

GPT3, OPT모델 크기별 zero-shot accuracy

GPT3, OPT모델 크기별 multi-shot accuracy

*참조 https://arxiv.org/pdf/2205.01068.pdf