Model checkpoints available and suggestions on language model architecture?

2dot71mily / youtube_captions_corrections

MIT License

10 stars 1 forks source link

Hi! I am exploring sentence transformers for a visual scene detection application, to correct automated close captioning according to what is found in the analyzed video frame. For example, if the video frame depicts a man moving his head but the automated video caption states "man moving hand", using computer vision-based methods to provide context for a language model which then corrects the caption to "man moving head".

So the thought was to train a language model on your dataset, then to somehow tokenize or provide as context the labels associated with the vision transformer or object detection pipeline which analyzes the video frame at that caption timestamps and then does scene identification / object detection in the analyzed frame. Some of the sentence correction models out there utilize token masking to determine the best "fit" from a dictionary of proposed replacement words, the idea would be to populate that dictionary with context retrieved from the vision models.

Any ideas on the ideal language model architecture would be for this? And have you made any language model checkpoints available from this dataset?

Thanks in advance!

Sorry that I’m just seeing this.

No model checkpoints yet. I just updated the README with a sample model for a sample task, as well as link to version of dataset added to Hugging Face Datasets library, for easy access.

I don’t have a ton of experience training LLMs, and not sure I understand your use case, but a fews ideas that might be helpful…

I am personally a big fan of trying to make your training task a token classification tasks (rather than a sequence / sentence classification task), as you get so much more potential learning per sequence. And in your case, sounds like that would work.
In terms of models architectures, I have no more specific idea than building a top Hugging Face’s distil-version if you can make that work.

Sorry probably wasn’t super helpful, and probably too late. I’d be great to hear what directions you took/are-taking.

2dot71mily / youtube_captions_corrections

Model checkpoints available and suggestions on language model architecture? #2