dartmouth-cs98-23f / project-short-learning

project-short-learning created by GitHub Classroom
0 stars 0 forks source link

Webscraper #161

Closed linkevin281 closed 6 months ago

linkevin281 commented 6 months ago

Webscraping

Background

This Webscraper will be used to target videos to scrape by { list of topics } or by { url }. This script will be run directly on our stream server. Please build a simple 1 endpoint API around it to accept URLs and perform the scraping on that URL.

TLDR: As long as it can scrape videos AND get the transcript (even if it doesnt exist) from those specific topics we're good. And store a metadata.

Requirements

Create a webscraper that can take a few topics, go to youtube and identify a few candidate videos from those topics. Those videos should be downloaded to our stream server. Additionally, it should be able to take in a URL as a parameter to be processed as well (it will not have topic information in this case which is ok).

There should be two folders in the stream server that are used to store videos. /NonProcessed and /Processed. Each downloaded youtube video should get it's own folder in the /NonProcessed folder. (Ex. 1).

Inside /NonProcessed/1 -> metadata.json, video.mp4, and transcript.txt. If the transcript does not exist for the video, please figure out some way to extract the transcript.

This metadata.json file should tell us:

  1. Tags/topics
  2. Author
  3. Title
  4. Whatever other metadata that the scraper allows you to extract (more is better for now)

Finally. there should be a tobeprocessed.txt file in the /NonProcessed folder. Please append the name of the folder (ex.1) to this file. The format doesn't really matter, its more to have this scaffolding to be able to change it to a pipeline queue later.

Use #133 and #145 as a baseline.

Cross-Team

Colton developed the stream server infra.