VanyaBK / visual_ASR_EC

Dataset for the task of visual ASR Error Correction
5 stars 0 forks source link

list of search keywords #2

Closed jasonppy closed 1 year ago

jasonppy commented 1 year ago

Could you also release the list of search keywords that you used for constructing the dataset?

Also, I'd appreciate it if you could release the final dataset (e.g. in the form of vid and segment timestamps) that you used for the error correction experiments

Thanks

VanyaBK commented 1 year ago

We did not use any keywords to construct the dataset. We filtered the original datasets by using a similarity threshold between the reference transcripts and the image caption as mentioned in the paper. We downloaded each of the youtube link using the youtube-dl toolkit which has the segmented timestamps, we encourage you to read the paper for further details, which is mentioned in the dataset retrieval section.

jasonppy commented 1 year ago

Thanks for your reply!

Regarding the keywords for constructing the dataset. In your paper, you describe the dataset collection as

The datasets were obtained from mainly two sources, the how2 dataset and the youtube videos. The how2 dataset consists of 300h of videos with annotated transcripts in English which resulted in 220,000 samples. The youtube videos were collected using the Youtube-DL toolkit, where each of these videos had annotated transcripts and audio in English. The youtube videos accounted for 2.5 million samples of data with annotated transcripts

The youtube video part (although How2 is also from youtube, I pointed out here just to avoid confusion to other readers) is first selected via some method right? You can't just collect all the videos on the web. In your previous commit, you documented on searching youtube videos with keywords https://github.com/VanyaBK/visual_ASR_EC/tree/5a86c37e6cac0339dbfc9a2a073c06c654dc416b/youtube-downloader#scripts, which is why I inferred that you used this approach for constructing the dataset.

VanyaBK commented 1 year ago

Thats's right, the keywords were used only to obtain the links, which is listed in the youtube_links file now. The youtube_links file do not contain the how2 data, this was obtained by contacting the author of the how2 paper.

jasonppy commented 1 year ago

How do you obtain the keywords?

VanyaBK commented 1 year ago

The keywords used are the categories of the how2100M dataset

jasonppy commented 1 year ago

got it, thanks for your patience!