This repo has the files used to do our video processing. It uses git-annex for the big files and other things are committed to git. It provides non-YouTube public access to our videos, and is also our working place for releasing videos (so a lot of the instructions below are for those that help processing them).
We also made a description of git-annex for data management, targeted to scientists and researchers, if you want to know what's going on behind the scenes.
Browse the repo - course links are below. More can be added later depending on demand.
Raw video data is stored using git-annex and synced around different places (our HPC cluster, the computers that process the videos, the object store Allas provided by CSC). Allas allows you to download the videos you might like:
$ git clone https://github.com/coderefinery/video-processing/
$ git annex get python-for-scicomp-2023/out/day1.1-icebreaker.mkv
get python-for-scicomp-2023/out/day1.1-icebreaker.mkv (from allas...)
Only processed videos are available to the general public (the raw private ones are recorded with git-annex in this repo, but not available for download). Also, this is a test setup and everything may be subject to change or depreciation.
(How was this set up? Get the environment variables needed for the
git-annex S3 special
remote - I did
this by running allas_conf
on one of the CSC computers. Then run
git annex initremote allas type=S3 encryption=none chunk=50MiB embedcreds=no host=a3s.fi protocol=https bucket=aaltoscicomp-video publicurl=https://aaltoscicomp-video.a3s.fi/ fileprefix=1- public=yes autoenable=true
- it caches the authentication locally on
that computer only, it doesn't spread to anywhere else.)
This repository stores the stuff used to process videos for CodeRefinery / Aalto Scientific Computing / etc(?). Here's how it works in general:
COURSE/raw/*.mkv
.srt
.srt
files into subtitles of each
sub-part. This allows us to parellelize the subtitle fixing and the
video slicing.git annex sync --content
moves all content around as desired,
making sure that the cluster has a full copy and other remotes have
only what they have requested.If you are helping with subtitle editing:
Find the COURSE/raw/*.srt
subtitle file and edit it as follows:
I don't watch all the video, but (very quickly) browse the text. Think 5 minutes (or less) skimming per hour of video, if there are no changes. Only focus on the important parts that can affect understanding, not making it a perfect presentation quality transcript. (I don't watch the video, I assume the transcript is correct except when it's clearly written wrong.)
Remove all names, replace with [name]
. Find and replace can be
useful here, but note there may be misspellings too, so you may have
to go try several times as you see more other spellings.
Fix up any command names, for example dash dash argument
becomes
--argument
, capitalization, etc. And other things that affect
understanding.
If you can't understand what someone is trying to say, replace with
[???]
or similar.
But it doesn't have to be perfect. Getting it done fast is the most important thing. "normal" speech doesn't have to be made perfect, but do what makes sense (what is worth your time).
Various subtitle editor programs can make this easier, but it's also
just a text file. I've used subtitleditor
on Linux, which can
playback the video right at each subtitle if you need to hear the
original.
If you notice something very wrong (Whisper has broken, it's not adding punctuation, etc), then don't try to fix it up, just leave it and make it minally usable.
If you are volunteering to help generate the edit list:
Raw videos files are private and only synced via our cluster.
Only do this if you are pulling the private (raw) big video files to your own computer to view them: otherwise, you can use git normally and the video files appear as broken symbolic links. For the final videos, you can get them using the public copy above.
Privacy notice: the git-annex info on which computers have which files get
publicly distributed through the repository (including through
Github). The info about your computer is the UUID and the
MY-COMPUTER-NAME
which is in the repo.
To set up this repo to connect to the Triton cluster:
(pull repo from github)
git remote add triton triton.aalto.fi:/scratch/scicomp/video-processing/
git config remote.triton.annex-shell /share/apps/git-annex/10.20230228.path/git-annex-shell
git annex init MY-COMPUTER-NAME # set up git-annex
git annex wanted . present # don't download everything, but keep what is here
git annex sync
git annex get python-for-scicomp/2023/raw/FILE.mkv