galv / lingvo-copy

Apache License 2.0
4 stars 0 forks source link

Daniel/dockerfile setup #12

Closed galv closed 3 years ago

galv commented 3 years ago

The motivation behind this is that we are getting a lot of difficult-to-understand dependencies, between the apache spark plugins to interact with google cloud buckets and TFRecords, as well as DSAlign, which is its own mess.

DSAlign isn't 100% there yet. It's complicated by the fact that we have to do an "editable" install within a dockerfile.

@greg1232 @agnusmaximus

galv commented 3 years ago

Another commit. Much more stable now:

Initial debug of DSAlign.

I've been running

docker run -p 0.0.0.0:8080:8080 -it -v ~/lingvo-copy/:/development/lingvo-source galvasr2:latest /bin/bash

Followed by ./galvasr2/align/align.sh

Note that I had to download the data with this:

gsutil cp gs://the-peoples-speech-west-europe/archive_org/Nov_6_2020/ALL_CAPTIONED_DATA/Highway_and_Hedges_Outreach_MinistriesShow-_Show_49/Highway_and_Hedges_Outreach_MinistriesShow-_Show_49.mp3

gsutil cp gs://the-peoples-speech-west-europe/archive_org/Nov_6_2020/ALL_CAPTIONED_DATA/Highway_and_Hedges_Outreach_MinistriesShow-_Show_49/Highway_and_Hedges_Outreach_MinistriesShow-_Show_49.asr.srt

Use Max's method of creating per-document language models. Note that it doesn't work yet due to missing pip dependencies. Working on it.

Delete accidentally committed ~ files.

galv commented 3 years ago

More summary: https://100k-hours.slack.com/archives/C01FGD1CCRZ/p1606809076013400 (It's corerecursive!)