Refactor data-related code & Add BOS-token option

bokyeong1015 commented 6 months ago

Description

Refactor data-related code
Add BOS-token option (to support Gemma)
- related: transformers/issues/29250, lm-evaluation-harness/pull/1465

Two csv files from PPL evaluation (results/$MODEL_NAME/ppl)
- ppl.csv: BOS token is added to only the first segment; previous implementation (add_bos_to_every=False)
- ppl_bos.csv: BOS token is added to all segments; new implementation (add_bos_to_every=True)
- https://github.com/Nota-NetsPresso/shortened-llm/blob/f1c913188207dea0f5a90d4957786abe3addf2b0/src/dataset.py#L38-L61
- Example scripts for Gemma will be uploaded in a separate PR

lifelongeeek commented 6 months ago

@bokyeong1015 Most changes in this PR are LGTM. I have two suggestions:

As one of the unit tests, could you provide some example setting/result that reproduce metrics in paper or technical reports?
Why don't we make default value of add_bos_to_every=True? My understanding is that keeping the add_bos_to_every=False setting is to reproduce past paper results, however, this setting cause inconsistent evaluation between samples in batches. (i.e., BOS is appended for only first sample).

bokyeong1015 commented 6 months ago

Example scripts and results are provided. By the way, the reproducibility of the current PR version has been carefully checked.
Good comment. The current repo is primarily aimed at reproducing the paper; however, I strongly agree with your feedback and will consider it for future updates :)