AlignmentResearch / tuned-lens

Tools for understanding how transformer predictions are built layer-by-layer
https://tuned-lens.readthedocs.io/en/latest/
MIT License
437 stars 47 forks source link

Moved data shuffling to just before tokenization step #110

Closed levmckinney closed 1 year ago

levmckinney commented 1 year ago

This is particularly useful for datasets like togethercomputer/RedPajama-Data-1T-Sample that do not come preshuffled. Often the local shuffling done by the dataloader is insufficient for datasets this large.

codecov[bot] commented 1 year ago

Codecov Report

Merging #110 (7b3172c) into main (51e988b) will decrease coverage by 0.05%. Report is 6 commits behind head on main. The diff coverage is 71.42%.

:exclamation: Current head 7b3172c differs from pull request most recent head 21db0bf. Consider uploading reports for the commit 21db0bf to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #110      +/-   ##
==========================================
- Coverage   81.08%   81.04%   -0.05%     
==========================================
  Files          32       32              
  Lines        2141     2147       +6     
==========================================
+ Hits         1736     1740       +4     
- Misses        405      407       +2     
Files Changed Coverage Δ
tuned_lens/scripts/ingredients.py 85.56% <71.42%> (-0.63%) :arrow_down: