Training data issues - Githubissues

Hello, I encountered some file format issues while training the model.Now I have a batch of my own Clues and Answers data that I want to use for training, but I don't know how to use them in training.

What format is the dataset in the following code? bash train_scripts/biencoder/tfidf.sh path/to/dataset

What are the specific formats of answers.jsonl and docs.jsonl?

python3 train_scripts/biencoder/get_tfidf_negatives.py \
--model path/to/dataset/tfidf/ \
--fills path/to/dataset/answers.jsonl \
--clues path/to/dataset/docs.jsonl \
--out path/to/dataset/ \
--no-len-filter

What data was used by train.json and validation.json? Are they the ones posted on huggingface? However, there is a difference between the CSV on the huggingface and the JSON required here.
```
CUDA_VISIBLE_DEVICES=0 bash train_scripts/biencoder/train_bert.sh \
path/to/dataset/train.json \
path/to/validation/validation.json \
checkpoints/biencoder/
```

In summary, can you provide examples of training files required for each step of the training process so that we can rewrite our own training data format?

Thank you very much indeed.

albertkx / Berkeley-Crossword-Solver

Training data issues #9