blackjack4494 / yt-dlc

media downloader and library for various sites.
The Unlicense
2.88k stars 361 forks source link

Parallel processing #176

Open SKnight79 opened 3 years ago

SKnight79 commented 3 years ago

Checklist

Description

YouTube-dl doesn’t seem to support parallel downloading. Until it does i use gnu parallel to run it but I notice it launches an instance of python each time it runs. Several python instances grab all the CPU available to run the tasks. Any ideas on how to run a python daemon that forks and runs each process under its own parent group?

jbruchon commented 3 years ago

It probably won't help, especially given the way the program writes fragments to the disk. Without a good SSD, running parallel downloads will cause severe disk thrashing or flash write pauses and possibly some on-disk fragmentation as well. If you run too many downloads in parallel, YouTube will lock you out with the dreaded 429 error. I run 2-3 separate downloads at once maximum, each with their own instance of youtube-dlc.

SKnight79 commented 3 years ago

Only matters for the time crunchers. I’ve grown past using a playlist.

I have a script that preflight checks a master list of playlists from YouTube, downloads a video ID list from each playlist extracting it from the Json output, compares the output with what it’s seen previously, then batch downloads the remaining unseen videos. I used GNU parallel to help with the batch job and use the video ID as an argument for input.

I’ve been doing up to 30 at a time (3 parallel instances running up to 10 tasks each) using GNU parallel.

No 429 here so far.

The issue doesn’t seem to be the disk load (I’m using HDDs) or network (1 Gb internet). The issue is the processor load.

Parallel so far is just xargs on steroids and processor core aware, but each process runs on its own environment space (setup, runtime, tear down).

When I switch the script to parallel mode it sucks up every slice of processor time to run the batch and makes other things running on it unstable. I think this is more to do with Python runtime loading up and setting up for YouTube-dl.

I wish there was a way to put YouTube-dl into daemon mode and it can run jobs concurrently and asynchronously but just use one main process to do it just run multithreaded.

If Youtube-dL (via python) ran as a daemon I could call it via api or process and send calls to it to run tasks. It could “queue” them up or “run in parallel”. I think it would be less overhead instead of each YouTube-dl instance running on its own.

Another tool I use is Flexget and they seem to have figured it out.

-Hector

On Nov 12, 2020, at 9:06 PM, Jody Bruchon notifications@github.com wrote:

 It probably won't help, especially given the way the program writes fragments to the disk. Without a good SSD, running parallel downloads will cause severe disk thrashing or flash write pauses and possibly some on-disk fragmentation as well. If you run too many downloads in parallel, YouTube will lock you out with the dreaded 429 error. I run 2-3 separate downloads at once maximum, each with their own instance of youtube-dlc.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

jbruchon commented 3 years ago

Have you tried the options -a BATCH_FILE and --no-part to make things faster? -a in particular will let you feed your list of URLs to the program directly rather than restarting it once per URL. Apparently you can also feed it files via stdin this way, so if you open a terminal and go youtube-dlc -a - I assume you can paste URLs all day long and it'll all run in the same instance; or, you can set up batch text files and use parallel to do a one-shot run of all the batches.

SKnight79 commented 3 years ago

Yeah I started with that but then I wanted to do search queries, playlists, title matches, and date ranges, and it started getting more complicated.

I could do a hybrid and split the batch list into let say 10 times, then launch parallel and point YouTube dl to process that list. That might ease the processor load down. Might try that.

-Hector

On Nov 12, 2020, at 9:48 PM, Jody Bruchon notifications@github.com wrote:

 Have you tried the options -a BATCH_FILE and --no-part to make things faster? -a in particular will let you feed your list of URLs to the program directly rather than restarting it once per URL. Apparently you can also feed it files via stdin this way, so if you open a terminal and go youtube-dlc -a - I assume you can paste URLs all day long and it'll all run in the same instance; or, you can set up batch text files and use parallel to do a one-shot run of all the batches.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.