AlphaReign / scraper

AlphaReigns DHT Scraper, includes peer updater and categorizer
MIT License
125 stars 34 forks source link

Rewrite #55

Open Raxvis opened 5 years ago

Raxvis commented 5 years ago

This issue thread will be used to keep everyone apprised of the rewrite taking place.

Raxvis commented 5 years ago

@ash121121 @milezz

I have ran into trouble with MySQL as the database isn't fast enough. On top of that, there are issues with the torrent scraper (to get meta data) that looks to be broken in that regard with the new rewrite.

I have been working on trying to find a method to overcome both of these and worked through a couple of iterations with no success. I am working on the third iteration that I hope to have more success with.

The new rewrite should fix a lot of the issues you guys are seeing with the tracker and scraper keeping up to speed with things

ghost commented 5 years ago

I agree mysql isn't that great for what were running here. Although some queries I was able to get down to milliseconds using indexes in mysql. bear in mind that's only with a database of 3 million. That would greatly increase as we reach the 20 + million. Could you tell us what the 2 iterations you have tried and what your 3rd is ? I'm interested to see if we can provide any ideas.

Kind regards

Raxvis commented 5 years ago

Both iterations were based on separating out the tracker, scraper, and torrent lookup (first one was with MySql and second one was with redis). This next iteration is going to isolate the individual actions but run them in a single process that has access to the DHT server and the DHT nodes (for scraping metadata)

ghost commented 5 years ago

What data store do you plan to use now , redis still ? have you looked at MongoDB ? i seen another dht scraper using it on github

Raxvis commented 5 years ago

Redis and ElasticSearch are the two that I will probably be using.

Redis for the peer / node information and ElasticSearch for the torrent information.

Raxvis commented 5 years ago

Just an update, I have completely rewritten the DHT Server portion and put it into it's own package here: https://github.com/AlphaReign/dht-server

This is a standalone DHT Server that will work as the backbone of our scraper, but will also allow us to query the DHT network for peer information so that we can download. With this being done, I can setup the initial code to just keep looking for peers and getting torrent announcements without having it tied directly into the scraper.

Raxvis commented 5 years ago

You can checkout this branch here: https://github.com/AlphaReign/scraper/tree/split-fix and run:

to watch it find torrents.

ghost commented 5 years ago

Thanks will check this out today :)

milezzz commented 5 years ago

Awesome work!

ghost commented 5 years ago

[ node ./src/index.js module.js:550 throw err; ^

Error: Cannot find module 'dht-server' at Function.Module._resolveFilename (module.js:548:15) at Function.Module._load (module.js:475:25) at Module.require (module.js:597:17) at require (internal/module.js:11:18) at Object. (/root/newscraper/src/index.js:1:75) at Module._compile (module.js:653:30) at Object.Module._extensions..js (module.js:664:10) at Module.load (module.js:566:32) at tryModuleLoad (module.js:506:12) at Function.Module._load (module.js:498:3) ](url)

@Prefinem

ghost commented 5 years ago

Never mind i installed dht-server and bencode

milezzz commented 5 years ago

seems to be working:


onGetPeersQuery - new torrent: 8eff86639946d68f2cea7485c59a3790794f78b9
onGetPeersQuery - new torrent: ef719bfbe716bd970afb4e269eab5ccb8fc1b3f2
total nodes 2000
onGetPeersQuery - new torrent: fc9b2d35164542b5704cef777b3b2560fe485cf9
onGetPeersQuery - new torrent: ad4f9ce5aa00943c01da3fd551250bd367729a7a
onGetPeersQuery - new torrent: 1224b03c763dafedae76d1a2dfb16a0396c90e72
jangrewe commented 5 years ago

If one were running the current scraper, is the dht-server a fully working replacement (feature wise, at least), or just a PoC for now?

Raxvis commented 5 years ago

Not currently. The end goal of this project is to have a working dht-server in it's own package. There are a few currently out there on NPM, but I have found most of them aren't suitable for a scraper, so I had planned on to taking the pieces I have right now and finishing up with a full fledged one.

The majority of the DHT server is here: https://github.com/AlphaReign/scraper/blob/master/src/crawler.js

What it mainly lacks is hooks for each method, and public methods for the external hooks. A good data backend is also required for performance. I had tested mongoDB but it couldn't perform under the load. Same with SQLite. My next stop will be Redis, or another in memory cache. This is actually a large reason the other dht-servers don't work. Most of them a) don't maintain enough nodes b) are slow in responses. This scraper works by being on the peer lists of thousands if not tens of thousands of nodes to get announcements from.

Ideally the project would be broken down into a) dht-server able to support 100K + nodes b) tracker (such as opentracker) that also helps maintain a list of torrents c) api for torrent information / searching

That, or another idea I have had in mind is to setup AlphaReign nodes that are dht-servers, but support a second protocol to share torrent information between each of the AlphaReign nodes, so that everyone using AlphaReign scraper can help share the torrent information.