2dot71mily / youtube_captions_corrections

MIT License
10 stars 1 forks source link

Model checkpoints available and suggestions on language model architecture? #2

Open pablogranolabar opened 3 years ago

pablogranolabar commented 3 years ago

Hi! I am exploring sentence transformers for a visual scene detection application, to correct automated close captioning according to what is found in the analyzed video frame. For example, if the video frame depicts a man moving his head but the automated video caption states "man moving hand", using computer vision-based methods to provide context for a language model which then corrects the caption to "man moving head".

So the thought was to train a language model on your dataset, then to somehow tokenize or provide as context the labels associated with the vision transformer or object detection pipeline which analyzes the video frame at that caption timestamps and then does scene identification / object detection in the analyzed frame. Some of the sentence correction models out there utilize token masking to determine the best "fit" from a dictionary of proposed replacement words, the idea would be to populate that dictionary with context retrieved from the vision models.

Any ideas on the ideal language model architecture would be for this? And have you made any language model checkpoints available from this dataset?

Thanks in advance!

2dot71mily commented 2 years ago

Sorry that I’m just seeing this.

No model checkpoints yet. I just updated the README with a sample model for a sample task, as well as link to version of dataset added to Hugging Face Datasets library, for easy access.

I don’t have a ton of experience training LLMs, and not sure I understand your use case, but a fews ideas that might be helpful…

Sorry probably wasn’t super helpful, and probably too late. I’d be great to hear what directions you took/are-taking.