Closed sfluegel05 closed 2 weeks ago
tokens.txt files that determine the mapping between input tokens (e.g.
[H]
) and their numbers (e.g.149
). If the tokens.txt is changed, models trained on data from before the change will not work with data after the change.
Given the scenario outlined, I was wondering if we should also consider checking the embedding offsets and other offsets defined in the reader file for consistency. Since these offsets influence the mapping between input tokens and their corresponding numbers, any changes done on them might also affect the compatibility of models trained on data before the change with data processed afterward.
chebai/preprocessing/reader.py
EMBEDDING_OFFSET = 10
PADDING_TOKEN_INDEX = 0
MASK_TOKEN_INDEX = 1
CLS_TOKEN = 2
It makes sense to test if these parameters change. However, Martin and I are not sure if this is possible with GitHub actions. @aditya0by0 could you give that a try?
It makes sense to test if these parameters change. However, Martin and I are not sure if this is possible with GitHub actions. @aditya0by0 could you give that a try?
Yes, you are right, there is not a direct way to achieve it, but there is a workaround we can try:
export_constants.py
) that reads constants from the relevant Python files and exports them into a JSON file.This method not only lets us check the current constants in the readers but also provides flexibility to incorporate additional constants for verification in the future if needed.
Note: This JSON file will be temporary—it only exists during the workflow execution and is not added to the repository.
I have implemented this approach in the Dummy PR, please check.
It makes sense to test if these parameters change. However, Martin and I are not sure if this is possible with GitHub actions. @aditya0by0 could you give that a try?
Yes, you are right, there is not a direct way to achieve it, but there is a workaround we can try:
- We can create a Python script (
export_constants.py
) that reads constants from the relevant Python files and exports them into a JSON file.- In the GitHub Actions workflow, we can run this script, then read the JSON file, and load these values as environment variables.
- This allows us to verify the constants against expected values directly within the workflow.
This method not only lets us check the current constants in the readers but also provides flexibility to incorporate additional constants for verification in the future if needed.
Note: This JSON file will be temporary—it only exists during the workflow execution and is not added to the repository.
To me, this looks like a sensible solution. You can go ahead with that.
The changes are already commited in PR #63. Please review and merge.
To me, this looks like a sensible solution. You can go ahead with that.
This came up during #48, but needs to be handled separately since it is not a unit test.
The scenario we should avoid is that someone changes the tokens.txt files that determine the mapping between input tokens (e.g.
[H]
) and their numbers (e.g.149
). If the tokens.txt is changed, models trained on data from before the change will not work with data after the change.Task