This pull request should hopefully solve #60 and correct some issues with the nats_to_bpb ratio calculation.
First I believe #60 was actually happening during the computation of the nats_to_bpb ratio and not during the actual tokenization step. Tokenization and nats_to_bpb ratio computation have now been combine into a single function, hopefully resolving #60. This combination also helps correct a bias in the nats_to_bpb ratio calculation that was previously caused by discarding the final batch.
This pull request should hopefully solve #60 and correct some issues with the nats_to_bpb ratio calculation.
First I believe #60 was actually happening during the computation of the
nats_to_bpb
ratio and not during the actual tokenization step. Tokenization and nats_to_bpb ratio computation have now been combine into a single function, hopefully resolving #60. This combination also helps correct a bias in thenats_to_bpb
ratio calculation that was previously caused by discarding the final batch.