State of content language as of June 2024

ikprk commented 1 week ago

Curated videos

The problem is pretty simple, we have only 4 videos hand-picked by the CWG (state on 19th June).

Rest of the feed

Our language detection approach wasn't good enough for the home feed. The overall accuracy was pretty good for the initial sample, but the problem was that sometimes it was filtering out the videos from high-value channels.

The thing is, we cannot use language property reliably, there are too many non-English videos marked as ones, so trying to use it as a filter is pointless.

Solutions

I know of two possible solutions:

We should push interactions tracking ASAP to start tethering the data for ML solution and after gathering a decent amount of them we should release Gleev with ML system. Using interactions geolocation as a way of bypassing poor language property accuracy.
Find some services that offer video-based language detection, integrate it into Orion, and use it for high-value channels.
Try text language detection with a more reliable approach. We could use some paid solution to detect language based on the video title, this should be fairly cheap.

dmtrjsg commented 1 week ago

Thank you for the update sir.

I think 3 would not work as we have english meta data for some Tamil and Bangla content. For 2 I already have a provider researched, so can enquire about the pricing.. For 1 seems like has synergies with overall Recombee deployment efforts so should we go in that exact order you suggested? 1 -> 2 -> 3 (last one only if 1 and 2 do not work)..

bedeho commented 1 week ago

Using interactions geolocation as a way of bypassing poor language property accuracy.

I don't think I understand how this is supposed to work.

Try text language detection with a more reliable approach. We could use some paid solution to detect language based on the video title, this should be fairly cheap.

I think this should be able to get to basically 100% accuracy on the text itself, and I dont think we need ansy sort of service really, this is an anchient problem, fully solved.

I think 3 would not work as we have english meta data for some Tamil and Bangla content.

Yeah there will be videos like this where they have perfectly english metadata, but content is for some reason not, but lets not have the perfect be the enemy of the good, this approach is by far the most simple solution, with less dependencies. Once you start getting into videos which have a mix of languages in the audio track, it really starts to become a semantic question what it means for a video to be in a given language, its not objective, and so perfection is not really a standard we can achieve.

dmtrjsg commented 2 days ago

@zeeshanakram3 we are awaiting corresponding PR deployment so we can test it..

Joystream / atlas