benkio / sBots

Collection of Scala bots based on different characters
1 stars 1 forks source link

Show AutoImport #461

Open benkio opened 5 months ago

benkio commented 5 months ago

Context

At the moment of writing, when a new show has to be added to the project, an INSERT command needs to be added to the related bot migration. This is tedious and not scalable, especially for bots with active characters such as Alessandro Barbero or Xah Lee.

Goal

We want a way to automatically import a list of YouTube videos, starting from a channel or a playlist.

Approach

Since we want ultimately to have the data in the DB, we can extend the botDB code to insert the shows from a JSON as we do for files. That JSON should be autogenerated by a scala script using tools such as yt-dlp to download all the necessary video metadata and properly pack that in a friendly JSON schema.

benkio commented 2 months ago

Investigation

Let's recap the investigations after playing along with some available tools.

Goal

Automatically fill the show table per bot. The idea is to have a configuration with the bots and the relative youtube channels/playlists where the videos are placed. The code will:

Get Video Metadata

yt-dlp exposes the -j option to get the JSON of a particular video showing all it's metadata. Example: yt-dlp -j https://www.youtube.com/watch?v=wHbjzOZApGo

It contains the necessary informations: url, title, description, duration, upload_date Extraction example using jq:

bash-5.2$ yt-dlp -j https://www.youtube.com/watch?v=wHbjzOZApGo | jq '. | {show_url: .webpage_url, show_title: .title, show_upload_date: .upload_date, show_duration: .duration, show_description: .description }'

{
  "show_url": "https://www.youtube.com/watch?v=wHbjzOZApGo",
  "show_title": "Il mio video in Autostrada ‘’ha fatto la storia di YouTube’’",
  "show_upload_date": "20190205",
  "show_duration": 1161,
  "show_description": "Salve cari amici followers , vi saluto con la mia solita passione e vi affido questo mio ultimo video , dove leggo quattro messaggi che mi sono arrivati: dei quali, uno in particolare merita molto per il suo equilibrio e per la sua forza . Buona visione 😀"
}

Get Channel/Playlist Metadata

This could be expanded to a single channel or playlist by using the -J option and the channel URL. The output will be huge, but the result from the single file command will be embedded inside a collection into the JSON, so all the information is there to be properly parsed and extracted.

here's an example command. It works for Barbero's playlist. Previously obtained through yt-dlp using

yt-dlp -J https://youtube.com/playlist?list=PL7lQFvEjqu8OBiulbaSNnlCtlfI8Zd7zS&si=5yXXFQk025DZctuE

and then parsed using jq this way:

cat barbero.json | jq 'del(..|nulls) | [.entries[] | {show_url: .webpage_url, show_title: .title, show_upload_date: .upload_date, show_duration: .duration, show_description: .description, show_is_live: .is_live, show_origin_automatic_caption: .automatic_captions | with_entries(if (.key|test(".*orig")) then ( {key: .key, value: .value } ) else empty end)[][] | select(.ext|contains("json")) | .url }]'

For YouTube channels instead, you need an extra filter in the beginning as the videos are wrapped. The json is obtained using the following command: yt-dlp -J https://www.youtube.com/@youtuboancheio1365

and then the result is parsed using jq this way:

cat youtubo.json | jq 'del(..|nulls) | .entries[] | select(.title|contains("Videos")) | [.entries[] | {show_url: .webpage_url, show_title: .title, show_upload_date: .upload_date, show_duration: .duration, show_description: .description, show_is_live: .is_live, show_origin_automatic_caption: .automatic_captions | with_entries(if (.key|test(".*orig")) then ( {key: .key, value: .value } ) else empty end)[][] | select(.ext|contains("json")) | .url }]'

To extract also shorts and live from the youtube channel, repeat the above command with a different filter in the select(...contains(..)) filter.

Extra

Automatic Captions

The JSON for a single video also contains an automatic captions section where another JSON is downloadable. It contains all the automatic text of a video!! So another column could be added to the table containing the captions in plain text after being properly extracted from the second JSON. Then, the command that searches into the database for shows can be extended to search for data in the captions too!

Command to extract the auto caption from the json downloaded from previous steps: cat f.txt | jq '[.events[] | select(.segs != null) | .segs[] | .utf8]'

Is Live

Many videos can be live recordings, we may be interested into saving such information.