AlexAltea / curator

Automated normalization and curating of media collections
Apache License 2.0
55 stars 2 forks source link

Wrong language detected in `curator tag -s audio` #4

Closed alanorth closed 1 year ago

alanorth commented 1 year ago

I'm not sure how much of this is in your control versus langid's, but I just tried curator tag -s audio on a media file that has no spoken dialog and it wanted to tag it as English. :)

$ curator tag -s audio -t language --only-macrolanguages The\ Red\ Turtle\ \(2016\).mkv
┌───┬───────────────────────────┬────────┬─────────┬───┬─────────┐
│ # │ Name                      │ Stream │ Old tag │ → │ New tag │
├───┼───────────────────────────┼────────┼─────────┼───┼─────────┤
│ 1 │ The Red Turtle (2016).mkv │ 1      │ fre     │ → │ eng     │
└───┴───────────────────────────┴────────┴─────────┴───┴─────────┘

I'm curious what the score returned by langid was, and why curator decided it was a match for English. Perhaps you could add a debug flag that printed the score and the threshold. Or, how do you think we can make this more accurate?


P.S. It's actually very strange that the original tag is French since there is no dialog (there is a soundtrack, but no talking).

alanorth commented 1 year ago

Another one:

$ curator tag -s audio -t language --only-macrolanguages The\ Dark\ Valley\ \(2014\).mkv 
┌───┬────────────────────────────┬────────┬─────────┬───┬─────────┐
│ # │ Name                       │ Stream │ Old tag │ → │ New tag │
├───┼────────────────────────────┼────────┼─────────┼───┼─────────┤
│ 1 │ The Dark Valley (2014).mkv │ 1      │ ger     │ → │ eng     │
└───┴────────────────────────────┴────────┴─────────┴───┴─────────┘
Continue? [y/N]

This one is definitely German audio, and lots of it.

AlexAltea commented 1 year ago

I'll add the usual verbosity CLI flags: -v, --verbose, to display the probabilities.

And specifically for tagging, I think there should be a clear cut-off, e.g. below X probability do not suggest updating tags.

AlexAltea commented 1 year ago

Additionally I think that we should let uses customize the number of samples taken for the analysis.

Right now the hardcoded number is 10: https://github.com/AlexAltea/curator/blob/master/curator/stream.py#L72

But some users might prefer higher accuracy in exchange of slower processing time.

AlexAltea commented 1 year ago

First part is done, now if you enable debug logging `--log=DEBUG' you should see something like:

curator tag --log=DEBUG -s audio -t language '.\temp\Airplane! (1980) [English].mp4'
2023-01-25 14:06:23,666 | INFO | Processing 1 input media files
2023-01-25 14:06:23,724 | DEBUG | Detecting audio language in stream #1 of media: "Airplane! (1980) [English].mp4"
2023-01-25 14:06:26,258 | DEBUG | Sample #00: {'en': '0.2963', 'la': '0.2408', 'nn': '0.1822', 'cy': '0.0706', 'zh': '0.0384'}        
2023-01-25 14:06:26,983 | DEBUG | Sample #01: {'en': '0.9895', 'nn': '0.0015', 'fr': '0.0010', 'la': '0.0009', 'ja': '0.0007'}        
2023-01-25 14:06:27,760 | DEBUG | Sample #02: {'en': '0.9461', 'pt': '0.0067', 'ru': '0.0054', 'ko': '0.0054', 'de': '0.0053'}        
2023-01-25 14:06:28,518 | DEBUG | Sample #03: {'en': '0.9845', 'pt': '0.0017', 'fr': '0.0015', 'nn': '0.0012', 'ja': '0.0011'}        
2023-01-25 14:06:29,274 | DEBUG | Sample #04: {'en': '0.9880', 'fr': '0.0014', 'pt': '0.0010', 'zh': '0.0010', 'de': '0.0008'}        
2023-01-25 14:06:30,095 | DEBUG | Sample #05: {'en': '0.9767', 'nn': '0.0024', 'la': '0.0024', 'fr': '0.0022', 'de': '0.0017'}        
2023-01-25 14:06:30,908 | DEBUG | Sample #06: {'en': '0.9290', 'nn': '0.0090', 'de': '0.0085', 'la': '0.0064', 'ja': '0.0058'}        
2023-01-25 14:06:31,770 | DEBUG | Sample #07: {'en': '0.9892', 'nn': '0.0032', 'ja': '0.0008', 'fr': '0.0006', 'pt': '0.0006'}        
2023-01-25 14:06:32,632 | DEBUG | Sample #08: {'en': '0.9903', 'la': '0.0018', 'es': '0.0007', 'ja': '0.0007', 'fr': '0.0007'}        
2023-01-25 14:06:33,511 | DEBUG | Sample #09: {'en': '0.9184', 'ja': '0.0101', 'nn': '0.0082', 'ru': '0.0065', 'zh': '0.0061'}
[...]

Can you share the results for the movie whose language gets misrecognized?

AlexAltea commented 1 year ago

Now you can also customize the number of samples via --max-audio-samples.

I've gotten fairly good results even with --max-audio-samples=5 so I'm quite interested about your case with "The Dark Valley (2014).mkv".

alanorth commented 1 year ago

Debug mode is sweet! Here is the first one, where the soundtrack has no dialog:

$ curator tag -s audio -t language --only-macrolanguages --log=DEBUG The\ Red\ Turtle\ \(2016\).mkv
2023-01-25 20:40:34,188 | INFO | Processing 1 input media files
2023-01-25 20:40:34,308 | DEBUG | Detecting audio language in stream #1 of media: "The Red Turtle (2016).mkv"
2023-01-25 20:40:36,596 | DEBUG | Sample #00: {'cy': '0.4168', 'en': '0.2776', 'nn': '0.0851', 'zh': '0.0820', 'ja': '0.0171'}
2023-01-25 20:40:38,896 | DEBUG | Sample #01: {'en': '0.6997', 'nn': '0.0995', 'zh': '0.0549', 'ko': '0.0228', 'ru': '0.0158'}
2023-01-25 20:40:41,112 | DEBUG | Sample #02: {'en': '0.5794', 'nn': '0.1794', 'zh': '0.0433', 'ru': '0.0392', 'ko': '0.0220'}
2023-01-25 20:40:43,579 | DEBUG | Sample #03: {'en': '0.7473', 'zh': '0.0670', 'nn': '0.0469', 'ru': '0.0309', 'ko': '0.0192'}
2023-01-25 20:40:46,095 | DEBUG | Sample #04: {'nn': '0.4392', 'en': '0.2651', 'zh': '0.0941', 'ko': '0.0788', 'jw': '0.0201'}
2023-01-25 20:40:50,791 | DEBUG | Sample #05: {'en': '0.6740', 'nn': '0.1084', 'zh': '0.0544', 'ko': '0.0244', 'la': '0.0214'}
2023-01-25 20:40:56,974 | DEBUG | Sample #06: {'en': '0.5547', 'zh': '0.1205', 'la': '0.1107', 'nn': '0.0309', 'ru': '0.0227'}
2023-01-25 20:41:07,375 | DEBUG | Sample #07: {'en': '0.5111', 'ru': '0.1270', 'nn': '0.1121', 'zh': '0.0442', 'ja': '0.0314'}
2023-01-25 20:41:19,530 | DEBUG | Sample #08: {'nn': '0.5731', 'en': '0.2331', 'ko': '0.0357', 'ja': '0.0274', 'zh': '0.0259'}
2023-01-25 20:41:33,377 | DEBUG | Sample #09: {'en': '0.3591', 'zh': '0.2241', 'la': '0.0982', 'nn': '0.0679', 'jw': '0.0383'}
┌───┬───────────────────────────┬────────┬─────────┬───┬─────────┐
│ # │ Name                      │ Stream │ Old tag │ → │ New tag │
├───┼───────────────────────────┼────────┼─────────┼───┼─────────┤
│ 1 │ The Red Turtle (2016).mkv │ 1      │ fre     │ → │ eng     │
└───┴───────────────────────────┴────────┴─────────┴───┴─────────┘
Continue? [y/N]

And the second one, where the language is definitely German:

$ curator tag -s audio -t language --only-macrolanguages --log=DEBUG The\ Dark\ Valley\ \(2014\).mkv
2023-01-25 20:43:13,410 | INFO | Processing 1 input media files
2023-01-25 20:43:13,524 | DEBUG | Detecting audio language in stream #1 of media: "The Dark Valley (2014).mkv"
2023-01-25 20:43:15,869 | DEBUG | Sample #00: {'en': '0.5387', 'la': '0.1408', 'nn': '0.0857', 'zh': '0.0532', 'ja': '0.0281'}
2023-01-25 20:43:18,767 | DEBUG | Sample #01: {'zh': '0.4106', 'de': '0.3250', 'ko': '0.0568', 'ja': '0.0509', 'ru': '0.0332'}
2023-01-25 20:43:21,614 | DEBUG | Sample #02: {'de': '0.9772', 'nn': '0.0048', 'en': '0.0026', 'nl': '0.0025', 'fr': '0.0023'}
2023-01-25 20:43:25,029 | DEBUG | Sample #03: {'nn': '0.6693', 'en': '0.1933', 'ko': '0.0244', 'haw': '0.0233', 'ja': '0.0154'}
2023-01-25 20:43:33,731 | DEBUG | Sample #04: {'en': '0.3350', 'nn': '0.3216', 'haw': '0.1072', 'zh': '0.0508', 'ko': '0.0368'}
2023-01-25 20:43:45,177 | DEBUG | Sample #05: {'nn': '0.3802', 'en': '0.3499', 'zh': '0.0458', 'ko': '0.0448', 'ru': '0.0397'}
2023-01-25 20:43:59,697 | DEBUG | Sample #06: {'en': '0.4349', 'la': '0.4012', 'zh': '0.0590', 'nn': '0.0138', 'ru': '0.0134'}
2023-01-25 20:44:15,148 | DEBUG | Sample #07: {'en': '0.4768', 'nn': '0.1963', 'ko': '0.0678', 'zh': '0.0619', 'ru': '0.0513'}
2023-01-25 20:44:33,486 | DEBUG | Sample #08: {'en': '0.8013', 'nn': '0.0372', 'zh': '0.0243', 'ko': '0.0188', 'ru': '0.0170'}
2023-01-25 20:44:53,860 | DEBUG | Sample #09: {'en': '0.5733', 'nn': '0.1582', 'ko': '0.0505', 'ru': '0.0487', 'zh': '0.0473'}
┌───┬────────────────────────────┬────────┬─────────┬───┬─────────┐
│ # │ Name                       │ Stream │ Old tag │ → │ New tag │
├───┼────────────────────────────┼────────┼─────────┼───┼─────────┤
│ 1 │ The Dark Valley (2014).mkv │ 1      │ ger     │ → │ eng     │
└───┴────────────────────────────┴────────┴─────────┴───┴─────────┘
Continue? [y/N]
AlexAltea commented 1 year ago

In both cases the probabilities are (mostly!) very low at around 0.3~0.7.

This is unsurprising for the silent movie (first one), but the second one is interesting... Note how at some point it's very confident it's German ('de': '0.9772').

I think selecting the final language should consider the probability, instead of doing a naive majority vote across all samples.

I'll push some test code later to address this!

AlexAltea commented 1 year ago

@alanorth Try the latest version!

Now it discards low probabilities (while still being fairly tolerant, threshold is 0.8), and additionally, it computes the final score as an average + majority vote to deal with ties.

Algorithm is still quite simple (https://github.com/AlexAltea/curator/commit/5b8c150a4da012d4dcfb32a7f664aba425e00a74), the relevant part was barely 5 lines, but I believe it should fix both issues you encountered!

alanorth commented 1 year ago

Ah that's clever! Now curator does the correct thing in both of these cases. First, the sound track with no dialog:

$ curator tag -s audio -t language --only-macrolanguages --log=DEBUG The\ Red\ Turtle\ \(2016\).mkv
2023-01-25 23:01:20,124 | INFO | Processing 1 input media files
2023-01-25 23:01:20,266 | DEBUG | Detecting audio language in stream #1 of media: "The Red Turtle (2016).mkv"
2023-01-25 23:01:22,230 | DEBUG | Sample #00: {'cy': '0.4168', 'en': '0.2776', 'nn': '0.0851', 'zh': '0.0820', 'ja': '0.0171'}
2023-01-25 23:01:24,467 | DEBUG | Sample #01: {'en': '0.6997', 'nn': '0.0995', 'zh': '0.0549', 'ko': '0.0228', 'ru': '0.0158'}
2023-01-25 23:01:26,636 | DEBUG | Sample #02: {'en': '0.5794', 'nn': '0.1794', 'zh': '0.0433', 'ru': '0.0392', 'ko': '0.0220'}
2023-01-25 23:01:29,006 | DEBUG | Sample #03: {'en': '0.7473', 'zh': '0.0670', 'nn': '0.0469', 'ru': '0.0309', 'ko': '0.0192'}
2023-01-25 23:01:31,789 | DEBUG | Sample #04: {'nn': '0.4392', 'en': '0.2651', 'zh': '0.0941', 'ko': '0.0788', 'jw': '0.0201'}
2023-01-25 23:01:34,209 | DEBUG | Sample #05: {'en': '0.6740', 'nn': '0.1084', 'zh': '0.0544', 'ko': '0.0244', 'la': '0.0214'}
2023-01-25 23:01:36,526 | DEBUG | Sample #06: {'en': '0.5547', 'zh': '0.1205', 'la': '0.1107', 'nn': '0.0309', 'ru': '0.0227'}
2023-01-25 23:01:39,387 | DEBUG | Sample #07: {'en': '0.5111', 'ru': '0.1270', 'nn': '0.1121', 'zh': '0.0442', 'ja': '0.0314'}
2023-01-25 23:01:42,051 | DEBUG | Sample #08: {'nn': '0.5731', 'en': '0.2331', 'ko': '0.0357', 'ja': '0.0274', 'zh': '0.0259'}
2023-01-25 23:01:44,769 | DEBUG | Sample #09: {'en': '0.3591', 'zh': '0.2241', 'la': '0.0982', 'nn': '0.0679', 'jw': '0.0383'}
Current plan requires no tasks. There is nothing to be done.

Second, the German one:

$ curator tag -s audio -t language --only-macrolanguages --log=DEBUG The\ Dark\ Valley\ \(2014\).mkv 
2023-01-25 22:56:38,434 | INFO | Processing 1 input media files
2023-01-25 22:56:38,605 | DEBUG | Detecting audio language in stream #1 of media: "The Dark Valley (2014).mkv"
2023-01-25 22:56:40,683 | DEBUG | Sample #00: {'en': '0.5387', 'la': '0.1408', 'nn': '0.0857', 'zh': '0.0532', 'ja': '0.0281'}
2023-01-25 22:56:43,441 | DEBUG | Sample #01: {'zh': '0.4106', 'de': '0.3250', 'ko': '0.0568', 'ja': '0.0509', 'ru': '0.0332'}
2023-01-25 22:56:46,331 | DEBUG | Sample #02: {'de': '0.9772', 'nn': '0.0048', 'en': '0.0026', 'nl': '0.0025', 'fr': '0.0023'}
2023-01-25 22:56:49,568 | DEBUG | Sample #03: {'nn': '0.6693', 'en': '0.1933', 'ko': '0.0244', 'haw': '0.0233', 'ja': '0.0154'}
2023-01-25 22:56:53,169 | DEBUG | Sample #04: {'en': '0.3350', 'nn': '0.3216', 'haw': '0.1072', 'zh': '0.0508', 'ko': '0.0368'}
2023-01-25 22:56:56,594 | DEBUG | Sample #05: {'nn': '0.3802', 'en': '0.3499', 'zh': '0.0458', 'ko': '0.0448', 'ru': '0.0397'}
2023-01-25 22:57:00,895 | DEBUG | Sample #06: {'en': '0.4349', 'la': '0.4012', 'zh': '0.0590', 'nn': '0.0138', 'ru': '0.0134'}
2023-01-25 22:57:04,759 | DEBUG | Sample #07: {'en': '0.4768', 'nn': '0.1963', 'ko': '0.0678', 'zh': '0.0619', 'ru': '0.0513'}
2023-01-25 22:57:09,003 | DEBUG | Sample #08: {'en': '0.8013', 'nn': '0.0372', 'zh': '0.0243', 'ko': '0.0188', 'ru': '0.0170'}
2023-01-25 22:57:13,439 | DEBUG | Sample #09: {'en': '0.5733', 'nn': '0.1582', 'ko': '0.0505', 'ru': '0.0487', 'zh': '0.0473'}
┌───┬────────────────────────────┬────────┬─────────┬───┬─────────┐
│ # │ Name                       │ Stream │ Old tag │ → │ New tag │
├───┼────────────────────────────┼────────┼─────────┼───┼─────────┤
│ 1 │ The Dark Valley (2014).mkv │ 1      │ ger     │ → │ deu     │
└───┴────────────────────────────┴────────┴─────────┴───┴─────────┘
Continue? [y/N]