Open stamm1989 opened 1 year ago
With today's release of SentenceTransformers V3, is this issue fixed? I was looking into using an iterable dataset as well with Ray Train and was wondering if the SentenceTransformerTrainer will work seemlessly.
Thank you for the update, I have not been able to verify if it now works. I diverted into using some other packages.
I'm currently trying to finetune the "bertje" model. I'm expecting to have a large dataset which exceeds my working memory of the machine i'm using. After some reading I found that the torch.utils.data.IterableDataset would be the solution for this, in combinations with potentially a webdataset data format.
However, the SentenceTransformer.fit function tries to retrieve the length of the dataset a couple of times, len(dataloader). Which by design isn't there since we do now know the length in advance. https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py#L629 https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/model_card_templates.py#L162 https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py#L656
I've also noted that there is a custom data_loader implementation, but this also does not seem to be the solution since it also requires me to put in the entire set in memory. https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/datasets/NoDuplicatesDataLoader.py
Can this become supported? Or is it already supported, but should I write my own custom dataloader class ?
Small code snippet: