Adding the function to compute the actual number of non padding tokens

NohTow commented 4 days ago

Changes This PR add a custom token counting method to compute the actual number of non padding tokens. This is helpful to get a more accurate estimation of the actual number of tokens processed, and also to precisely set the limits of training tokens for our different mixtures.

I am not sure the implementation is the cleanest possible, I wanted to directly add it to the build_dataloader function but it would have required to pass the full configuration to it instead of just the corresponding data one and add a parameter to specify it we were building eval or train data loader. Also, I am only adding it if the padding parameter is set to unpadded, which is not the case for MosaicBERT. Maybe we want to use this function for every case anyways, feel free to tell me and I'll make the changes.

Tests I did not run tests as it's only changing the token counting (which is not tested atm).

[ ] Is the new feature tested? (Not always necessary for all changes -- just adding to the checklist to keep track)
[ ] Have you ran all the tests?
[ ] Do the tests all pass?
[ ] If not, have you included an explanation of which tests this PR breaks and/or why (below this checklisT)

warner-benjamin commented 4 days ago

Can you turn this into an option to turn on/off via the config and set it to false by default? We need to make sure we can match the token count for existing ablations.

NohTow commented 3 days ago

I added this as a parameter of build_dataloader and read it as a top-level option in the config file with the name count_padding_tokens. The default value of both the function (if the parameter is not set, as when creating the eval data loader) and the config (if the parameter is not specified) is set to True, i.e, the previous behavior that counts the padding tokens.

AnswerDotAI / bert24

Adding the function to compute the actual number of non padding tokens #78