[u[0] for u in
re.findall("(https?://(www\.)?(youtube\.com/(watch\?[a-zA-Z0-9=\&]*v=|embed/)|youtu.be/)[a-zA-Z0-9]{11})",
'bit of trash text youtube.com/feed/subscriptions
https://www.youtube.com/watch?v=ch69W2l1Mak <iframe width="1869"
height="763" src="https://www.youtube.com/embed/ch69W2l1Mak"
title="YouTube video player" frameborder="0" allow="accelerometer;
autoplay; clipboard-write; encrypted-media; gyroscope;
picture-in-picture" allowfullscreen></iframe>
https://youtu.be/ch69W2l1Mak?t=10 https://www.youtube.com/watch?v=ch69W
http://youtube.com/watch?v=ch69W2l1Mak')]
- trash text is ignored
- subscriptions etc ignored
- proper https desktop link matches
- random embed html ignored
- but proper embed link extracted
- youtu.be extracted, ignoring additional arguments
- broken link (too short ID) ignored
- match even without https and www
I also updated the other embed regex to improve the matching accuracy regarding broken IDs.
Feel free to edit if you disagree with any of these cases. E.g. the if else could be removed altogether, but I suppose using the embed only is more accurate. Or perhaps only the second if only embeds are needed?
Fixes #47
I think this should work... I tested:
['https://www.youtube.com/watch?v=ch69W2l1Mak', 'https://www.youtube.com/embed/ch69W2l1Mak', 'https://youtu.be/ch69W2l1Mak', 'http://youtube.com/watch?v=ch69W2l1Mak']