alanmcgovern / monotorrent

The official repository for MonoTorrent, a bittorrent library for .NET
https://github.com/alanmcgovern/monotorrent
MIT License
1.14k stars 396 forks source link

Passing DhtEngine into ClientEngine? #677

Closed mattheys closed 3 weeks ago

mattheys commented 4 weeks ago

Hi,

I've got to a bit of a bottleneck that I can't seem to get past now.

I've got several thousand InfoHash's and I am writing something to download the metadata and put it into a database.

However I can't work out how to get my ClientEngine to use my DhtEngine.

The Dht object inside my ClientEngine never connects, I think this is down to some bootstrap servers not being available, so I've created a DhtEngine and bootstrapped some more servers into it and that connects fine.

However I have no idea how I can get the ClientEngine to use the working DhtEngine.

Do I need to use the Factories class?

Thanks

alanmcgovern commented 3 weeks ago

Option 1

            // create this once
            var dhtEngine = new MonoTorrent.Dht.DhtEngine ();

            // Configure the factory to return the same instance every time
            var factories = Factories.Default.WithDhtCreator (() => {
                return dhtEngine;
            });
            using var engine = new ClientEngine (settingBuilder.ToSettings (), factories);

Option 2

            // Configure the factory to return a new instance every time. This means one will be created per ClientEngine.
            var factories = Factories.Default.WithDhtCreator (() => {
                var dhtEngine = new DhtEngine ();
                // do something?
                return dhtEngine;
            });
            using var engine = new ClientEngine (settingBuilder.ToSettings (), factories);

Both of those options should work depending on what you want to accomplish. If you need complex async initialisation you'll need to do option 1, and just fully pre-construct the dht engine (and init it) before configuring the factory and passing the configured factory to the ClientEngine.

alanmcgovern commented 3 weeks ago

I've got several thousand InfoHash's and I am writing something to download the metadata and put it into a database.

I assume you're implementing this using ClientEngine.DownloadMetadataAsync so you can fetch the metadata for an infohash without downloading the whole torrent. If so, i'd love to know how it performs. I recently reimplemented this support so it fetches metadata in the same way a regular torrent downloads the final few pieces, so it should be reasonably-ok performance wise. However, i've no idea where the bottlenecks would be if someone ran that 100s of times :)

mattheys commented 3 weeks ago

That's brilliant, thank you. I thought it would have something to do with the factories but couldn't work out exactly what to do.

Yes I'm using the ClientEngine.DownloadMetadataAsync I'm using an in memory MassTransit hub with a queue depth of 128 to queue up the requests in a singleton ClientEngine, then I'm also concurrently waiting on a 60 second Task.Delay for each job and sending the cancellation signal if it's timed out so I don't clog up the queue with dead torrents.

I'm still getting quite a lot of timeouts but I think I'm getting more successes now than without Dht working. I think I'll have to play with the queue depth and timeout time a bit to find an optimal balance, I'm not sure if I'm producing too many requests for the ClientEngine to handle within a specific window.

With the current settings I've got 150 successes in 10 mins. I think I'll start by lowering the queue depth to 64 and see if that changes anything, then increasing the timeout to 120 seconds after that.

alanmcgovern commented 3 weeks ago

One issue with DownloadMetadataAsync is that it's not trivial to get observability on what's happening under the hood:

Was the info hash successfully found in the DHT table. If so did it find peers. If so, were they connectable. Did they support fast peer extensions or not. Etc etc.

This would be relevant information to understand if this is as expected or if there's a bug.

That said - could you confirm what you've configured as the max connections for the engine, and if the engine is saturating that value? If so, you might have fewer timeouts if you increased the total number of connections, or decreased concurrency.

mattheys commented 2 weeks ago

I had the MaximumConnections set to 300, what I also did was to use RabbitMq and split this out into a separate console app so I could run multiple instances at once. I just spun up new instances until the CPU was nearly 100% utilised. I've managed to process 50,000 so far with 35,000 to go, however the last 35k might be unseeded for all I know and I think have already been skipped once. I'm happy with how this is working anyway.