Tyrrrz / YoutubeExplode

Abstraction layer over YouTube's internal API
MIT License
2.94k stars 491 forks source link

Some spaces are missing in closed captions #204

Closed foadabdollahi closed 5 years ago

foadabdollahi commented 5 years ago

some words in arabic subtitle need space between them. cause is remove some tag for convert to text

SlowLogicBoy commented 5 years ago

Can you provide:

  1. Youtube video example
  2. Expected result
  3. Acctual result

Because I'm not familiar with Arabic languages, and how they should be spaced/formatted.

foadabdollahi commented 5 years ago

thanks,

var client = new YoutubeClient();
var trackInfos = await client.GetVideoClosedCaptionTrackInfosAsync("_QdPW8JrYzQ");
var trackInfo = trackInfos.First(t => t.Language.Code == "en");
**trackInfo.Url+="&tlang=fa";**
var track = await client.GetClosedCaptionTrackAsync(trackInfo);

var caption = track.GetByTime(TimeSpan.FromSeconds(61));
var text = caption.Text; // "And the game was afoot."

this tube : https://www.youtube.com/watch?v=C6jS7rhMm5Y

fot this link : link

after get data from new format from var text = caption.Text; some word hs not space:

orginal in you tube is :

<p t="7649" d="3481">
<s> می خواهم به شما در مورد </s>
<s p="2">یک</s>
<s> </s>
<s p="1">هوشمند جدید</s>
<s> به </s>
</p>

youtubeexplode reslut is :

4
00:00:07.649 --> 00:00:11.130
 می خواهم به شما در مورد یکهوشمند جدید به 

true is : یک هوشمند and false is یکهوشمند

i hope you can understand problem

Tyrrrz commented 5 years ago

trackInfo.Url+="&tlang=fa"; this isn't valid because it has no public setter.

Your link gives 404 for me.

The video you linked does not have arabic subtitles anyway image

foadabdollahi commented 5 years ago

yes. if you on auto translate and set any language like persian or arabic. you can see in your inspection one new file load and add to page. this file is translate of subtitle. and in youtubeexplode some remove that is need in arrabic words. notfound is cause it has some sign with expirtion and after some minutes it lost.

if you check you inspection and network tab. and set on auto translate you can see your link

foadabdollahi commented 5 years ago

trackInfo.Url+="&tlang=fa";

i create new object and set this value on that. and it was work correct. :D

Tyrrrz commented 5 years ago

It seems to be caused by this issue: https://stackoverflow.com/questions/2737636/c-sharp-linq-to-xml-missing-space-character

foadabdollahi commented 5 years ago

maybe, i think for remove other tag inside p tag regex can be helpfull

foadabdollahi commented 5 years ago
  var tagsExpression = new Regex(@"</?.+?>");
 return tagsExpression.Replace(input, " ");

somethig like this

Tyrrrz commented 5 years ago

There's apparently an option to preserve whitespace, I fixed it.