-
Hi,
Thanks for providing the pre-training database with foldseek tokens! having difficulty downloading the dataset and using with hugginface functions. Trying
```
from datasets import load_da…
-
I notice percent in the split dataset do not match the expected proportions. It seems split is made before filtering patches by min_annot_perc.
-
Hi, I'd like to have a few questions on the workflow combining [llm-recipes](https://github.com/Nicolas-BZRD/llm-recipes) with [llm-distillation](https://github.com/Nicolas-BZRD/llm-distillation) to c…
-
Training a BPE tokenizer from scratch, I am using Split pretokenization. In the below example, I split on each digit so that numbers are represented by the sequences of digits they are made of.
```…
-
### Reproducing the behavior
Running this script with "bend run-c" I can parse a file until with 2\*\*15 elements with the following code, but I get the beginning of the list to print and then a se…
-
Hi, I want to know the train/test split on the TikTok and UBCFashion dataset. Could you please provide how you split the academic public dataset into training and testing?
According to https://githu…
-
Good job!!! I want to follow your work and compare to it. Could you please provide the three dataset after you split?
-
Hi, thank you for your interesting work.
I have a question when trying your code I am not aware of descriptions about the training, evaluation, and testing data split. I would like to ask your fa…
-
Hi, could you share the file list of the training set (9100 pairs) and the testing set (4900 pairs)?
-
Hello, I would like to ask some questions about dataset partitioning? What are the data sets simple, medium, and complex?