Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
27.08k stars 2.72k forks source link

[BUG]: Youtube data collector "Failed to locate a transcript for this video!" #2597

Closed stdestro closed 1 week ago

stdestro commented 1 week ago

How are you running AnythingLLM?

Docker (remote machine)

What happened?

Trying to collect transcript from Youtube transcript data connector i have a local installation on MacOs that collect the transcript, while the docker instance on Ubuntu gives the error:

Failed to locate a transcript for this video!

The video link is the same, so the video is not the problem (tried with different videos, same result). i got the same LLM model, same llm Agent in desktop app and docker instance Docker instance is started with --cap-add SYS_ADMIN I can scrape websites through data collector smoothly in the docker instance. the only problem is collecting transcripts from youtube

Are there known steps to reproduce?

No response

timothycarambat commented 1 week ago

The IP your ubuntu instance is on is probably being blocked by Google from reaching https://www.youtube.com/watch URLs, as that is the only thing that would prevent this. Since it works on other platforms and you can scrape sites in general.

It is also possible that when accessing the video from the Ubuntu IP the video is blocked in that geography associated with the IP.

stdestro commented 1 week ago

It's not a geo restriction, tried with different videos. I can reach the video from terminal using curl and using lynx, so it seems youtube is not blocking my ip

ubuntu@instance-2024xxx-xxx:~$ curl -I https://www.youtube.com/watch?v=ugpFyDQexlA
HTTP/2 200 
content-type: text/html; charset=utf-8
x-content-type-options: nosniff
cache-control: no-cache, no-store, max-age=0, must-revalidate
pragma: no-cache
expires: Mon, 01 Jan 1990 00:00:00 GMT
date: Fri, 08 Nov 2024 07:13:56 GMT
content-length: 920191
x-frame-options: SAMEORIGIN
strict-transport-security: max-age=31536000
origin-trial: AmhMBR6zCLzDDxpW+HfpP67BqwIknWnyMOXOQGfzYswFmJe+fgaI6XZgAzcxOrzNtP7hEDsOo1jdjFnVr2IdxQ4AAAB4eyJvcmlnaW4iOiJodHRwczovL3lvdXR1YmUuY29tOjQ0MyIsImZlYXR1cmUiOiJXZWJWaWV3WFJlcXVlc3RlZFdpdGhEZXByZWNhdGlvbiIsImV4cGlkjlkjòlkjòlkjkjkzE5OSwiaXNTdWJkb21haW4iOnRydWV9
report-to: {"group":"youtube_main","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/youtube_main"}]}
content-security-policy: require-trusted-types-for 'script'
cross-origin-opener-policy: same-origin-allow-popups; report-to="youtube_main"
permissions-policy: ch-ua-arch=*, ch-ua-bitness=*, ch-ua-full-version=*, ch-ua-full-version-list=*, ch-ua-model=*, ch-ua-wow64=*, ch-ua-form-factors=*, ch-ua-platform=*, ch-ua-platform-version=*
p3p: CP="This is not a P3P policy! See http://support.google.com/accounts/answer/151657?hl=it for more info."
server: ESF
x-xss-protection: 0
set-cookie: YSC=-SoXXU7Xipk; Domain=.youtube.com; Path=/; Secure; HttpOnly; SameSite=none
set-cookie: __Secure-YEC=CgtZdkZXSlFtNUjfsjdksdjhsjkdhfkjhsjdhjfdjhwYGRobHB0eHw4PIBAREiEgEg%3D%3D; Domain=.youtube.com; Expires=Mon, 08-Dec-2025 07:13:55 GMT; Path=/; Secure; HttpOnly; SameSite=lax
set-cookie: VISITOR_PRIVACY_METADATA=CgJJVBIcEhgSFhMLhjjhgljhbvjhvjhHB0eHw4PIBAREiEgEg%3D%3D; Domain=.youtube.com; Expires=Mon, 08-Dec-2025 07:13:57 GMT; Path=/; Secure; HttpOnly; SameSite=none
set-cookie: VISITOR_INFO1_LIVE=; Domain=.youtube.com; Expires=Sat, 12-Feb-2022 07:13:57 GMT; Path=/; Secure; HttpOnly; SameSite=none
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
timothycarambat commented 1 week ago

So any Youtube video does not work on this instance?

stdestro commented 1 week ago

exactly, tried 10 different videos from different regions. it works on the desktop app, not on the docker instance on Oracle VM

timothycarambat commented 1 week ago

When viewing the docker logs and attempting a collection do we see a [collector] line item that shows any more information about that error besides the user facing one?

Hoping this error fires https://github.com/Mintplex-Labs/anything-llm/blob/890fb29464d4d571f714855ef1d9725a5b2011fc/collector/utils/extensions/YoutubeTranscript/YoutubeLoader/youtube-transcript.js#L91

stdestro commented 1 week ago

this is the docker log after 3 tries

[backend] info: [EncryptionManager] Loaded existing key & salt for encrypting arbitrary data.
[collector] info: -- Working YouTube https://www.youtube.com/watch?v=L1RMd96eHgo --
[backend] info: [EncryptionManager] Loaded existing key & salt for encrypting arbitrary data.
[collector] info: -- Working YouTube https://www.youtube.com/watch?v=eyVDMJN0sa8 --
[backend] info: [EncryptionManager] Loaded existing key & salt for encrypting arbitrary data.
[collector] info: -- Working YouTube https://www.youtube.com/watch?v=eyVDMJN0sa8 --
timothycarambat commented 1 week ago

@stdestro Does this thread apply

The script we are using is a fork of that repo - we broke from it a long time ago to force patch something in that data connector but thinking of the network difference I wonder if this is the issue and its because the ipv4 and ipv6 responses from youtube.com are different?

stdestro commented 1 week ago

so, for me curl -L -4 is working fine (returns the html), and curl -L -6 returns curl: (7) Couldn't connect to server

i cannot ping ipv6

ubuntu@instance-20241107-1421:~$ ping6 youtube.com
ping6: connect: Network is unreachable

while ipv4


ubuntu@instance-20241107-1421:~$ ping youtube.com
PING youtube.com (216.58.205.46) 56(84) bytes of data.
64 bytes from mil04s24-in-f14.1e100.net (216.58.205.46): icmp_seq=1 ttl=117 time=8.07 ms
64 bytes from mil04s24-in-f14.1e100.net (216.58.205.46): icmp_seq=2 ttl=117 time=8.03 ms
64 bytes from mil04s24-in-f46.1e100.net (216.58.205.46): icmp_seq=3 ttl=117 time=8.02 ms
64 bytes from lhr48s23-in-f14.1e100.net (216.58.205.46): icmp_seq=4 ttl=117 time=8.06 ms

while from desktop i get this:

s@MacBookAir ~ % ping youtube.com       
PING youtube.com (142.251.209.14): 56 data bytes
64 bytes from 142.251.209.14: icmp_seq=0 ttl=118 time=12.378 ms
64 bytes from 142.251.209.14: icmp_seq=1 ttl=118 time=8.915 ms
64 bytes from 142.251.209.14: icmp_seq=2 ttl=118 time=9.180 ms
64 bytes from 142.251.209.14: icmp_seq=3 ttl=118 time=16.002 ms