Closed galv closed 3 years ago
Another commit. Much more stable now:
Initial debug of DSAlign.
I've been running
docker run -p 0.0.0.0:8080:8080 -it -v ~/lingvo-copy/:/development/lingvo-source galvasr2:latest /bin/bash
Followed by ./galvasr2/align/align.sh
Note that I had to download the data with this:
gsutil cp gs://the-peoples-speech-west-europe/archive_org/Nov_6_2020/ALL_CAPTIONED_DATA/Highway_and_Hedges_Outreach_MinistriesShow-_Show_49/Highway_and_Hedges_Outreach_MinistriesShow-_Show_49.mp3
gsutil cp gs://the-peoples-speech-west-europe/archive_org/Nov_6_2020/ALL_CAPTIONED_DATA/Highway_and_Hedges_Outreach_MinistriesShow-_Show_49/Highway_and_Hedges_Outreach_MinistriesShow-_Show_49.asr.srt
Use Max's method of creating per-document language models. Note that it doesn't work yet due to missing pip dependencies. Working on it.
Delete accidentally committed ~ files.
More summary: https://100k-hours.slack.com/archives/C01FGD1CCRZ/p1606809076013400 (It's corerecursive!)
The motivation behind this is that we are getting a lot of difficult-to-understand dependencies, between the apache spark plugins to interact with google cloud buckets and TFRecords, as well as DSAlign, which is its own mess.
DSAlign isn't 100% there yet. It's complicated by the fact that we have to do an "editable" install within a dockerfile.
@greg1232 @agnusmaximus