Open benkio opened 5 months ago
Let's recap the investigations after playing along with some available tools.
Automatically fill the show table per bot. The idea is to have a configuration with the bots and the relative youtube channels/playlists where the videos are placed. The code will:
yt-dlp
exposes the -j
option to get the JSON of a particular video showing all it's metadata. Example:
yt-dlp -j https://www.youtube.com/watch?v=wHbjzOZApGo
It contains the necessary informations: url
, title
, description
, duration
, upload_date
Extraction example using jq
:
bash-5.2$ yt-dlp -j https://www.youtube.com/watch?v=wHbjzOZApGo | jq '. | {show_url: .webpage_url, show_title: .title, show_upload_date: .upload_date, show_duration: .duration, show_description: .description }'
{
"show_url": "https://www.youtube.com/watch?v=wHbjzOZApGo",
"show_title": "Il mio video in Autostrada ‘’ha fatto la storia di YouTube’’",
"show_upload_date": "20190205",
"show_duration": 1161,
"show_description": "Salve cari amici followers , vi saluto con la mia solita passione e vi affido questo mio ultimo video , dove leggo quattro messaggi che mi sono arrivati: dei quali, uno in particolare merita molto per il suo equilibrio e per la sua forza . Buona visione 😀"
}
This could be expanded to a single channel or playlist by using the -J
option and the channel URL.
The output will be huge, but the result from the single file command will be embedded inside a collection into the JSON, so all the information is there to be properly parsed and extracted.
here's an example command. It works for Barbero's playlist. Previously obtained through yt-dlp using
yt-dlp -J https://youtube.com/playlist?list=PL7lQFvEjqu8OBiulbaSNnlCtlfI8Zd7zS&si=5yXXFQk025DZctuE
and then parsed using jq
this way:
cat barbero.json | jq 'del(..|nulls) | [.entries[] | {show_url: .webpage_url, show_title: .title, show_upload_date: .upload_date, show_duration: .duration, show_description: .description, show_is_live: .is_live, show_origin_automatic_caption: .automatic_captions | with_entries(if (.key|test(".*orig")) then ( {key: .key, value: .value } ) else empty end)[][] | select(.ext|contains("json")) | .url }]'
For YouTube channels instead, you need an extra filter in the beginning as the videos are wrapped. The json is obtained using the following command:
yt-dlp -J https://www.youtube.com/@youtuboancheio1365
and then the result is parsed using jq
this way:
cat youtubo.json | jq 'del(..|nulls) | .entries[] | select(.title|contains("Videos")) | [.entries[] | {show_url: .webpage_url, show_title: .title, show_upload_date: .upload_date, show_duration: .duration, show_description: .description, show_is_live: .is_live, show_origin_automatic_caption: .automatic_captions | with_entries(if (.key|test(".*orig")) then ( {key: .key, value: .value } ) else empty end)[][] | select(.ext|contains("json")) | .url }]'
To extract also shorts and live from the youtube channel, repeat the above command with a different filter in the select(...contains(..))
filter.
The JSON for a single video also contains an automatic captions
section where another JSON is downloadable. It contains all the automatic text of a video!! So another column could be added to the table containing the captions in plain text after being properly extracted from the second JSON. Then, the command that searches into the database for shows can be extended to search for data in the captions too!
Command to extract the auto caption from the json downloaded from previous steps:
cat f.txt | jq '[.events[] | select(.segs != null) | .segs[] | .utf8]'
Many videos can be live recordings, we may be interested into saving such information.
Context
At the moment of writing, when a new show has to be added to the project, an
INSERT
command needs to be added to the related bot migration. This is tedious and not scalable, especially for bots with active characters such as Alessandro Barbero or Xah Lee.Goal
We want a way to automatically import a list of YouTube videos, starting from a channel or a playlist.
Approach
Since we want ultimately to have the data in the DB, we can extend the
botDB
code to insert the shows from a JSON as we do for files. That JSON should be autogenerated by a scala script using tools such asyt-dlp
to download all the necessary video metadata and properly pack that in a friendly JSON schema.