Add Github Actions for checking tokens.txt and reader constants consistency

sfluegel05 commented 1 month ago

This came up during #48, but needs to be handled separately since it is not a unit test.

The scenario we should avoid is that someone changes the tokens.txt files that determine the mapping between input tokens (e.g. [H]) and their numbers (e.g. 149). If the tokens.txt is changed, models trained on data from before the change will not work with data after the change.

Task

Implement an Actions workflow that checks for all tokens files that they are only appended and no lines are deleted

aditya0by0 commented 4 weeks ago

tokens.txt files that determine the mapping between input tokens (e.g. [H]) and their numbers (e.g. 149). If the tokens.txt is changed, models trained on data from before the change will not work with data after the change.

Given the scenario outlined, I was wondering if we should also consider checking the embedding offsets and other offsets defined in the reader file for consistency. Since these offsets influence the mapping between input tokens and their corresponding numbers, any changes done on them might also affect the compatibility of models trained on data before the change with data processed afterward.

chebai/preprocessing/reader.py

EMBEDDING_OFFSET = 10
PADDING_TOKEN_INDEX = 0
MASK_TOKEN_INDEX = 1
CLS_TOKEN = 2

sfluegel05 commented 3 weeks ago

It makes sense to test if these parameters change. However, Martin and I are not sure if this is possible with GitHub actions. @aditya0by0 could you give that a try?

aditya0by0 commented 3 weeks ago

It makes sense to test if these parameters change. However, Martin and I are not sure if this is possible with GitHub actions. @aditya0by0 could you give that a try?

Yes, you are right, there is not a direct way to achieve it, but there is a workaround we can try:

We can create a Python script (export_constants.py) that reads constants from the relevant Python files and exports them into a JSON file.
In the GitHub Actions workflow, we can run this script, then read the JSON file, and load these values as environment variables.
This allows us to verify the constants against expected values directly within the workflow.

This method not only lets us check the current constants in the readers but also provides flexibility to incorporate additional constants for verification in the future if needed.

Note: This JSON file will be temporary—it only exists during the workflow execution and is not added to the repository.

aditya0by0 commented 3 weeks ago

I have implemented this approach in the Dummy PR, please check.

It makes sense to test if these parameters change. However, Martin and I are not sure if this is possible with GitHub actions. @aditya0by0 could you give that a try?

Yes, you are right, there is not a direct way to achieve it, but there is a workaround we can try:

We can create a Python script (export_constants.py) that reads constants from the relevant Python files and exports them into a JSON file.

In the GitHub Actions workflow, we can run this script, then read the JSON file, and load these values as environment variables.

This allows us to verify the constants against expected values directly within the workflow.

This method not only lets us check the current constants in the readers but also provides flexibility to incorporate additional constants for verification in the future if needed.

Note: This JSON file will be temporary—it only exists during the workflow execution and is not added to the repository.

sfluegel05 commented 2 weeks ago

To me, this looks like a sensible solution. You can go ahead with that.

aditya0by0 commented 2 weeks ago

The changes are already commited in PR #63. Please review and merge.

To me, this looks like a sensible solution. You can go ahead with that.

ChEB-AI / python-chebai

Add Github Actions for checking tokens.txt and reader constants consistency #60

Task