AlphaReign / scraper

AlphaReigns DHT Scraper, includes peer updater and categorizer
MIT License
125 stars 34 forks source link

Add ElasticSearch datastore #23

Closed Raxvis closed 5 years ago

ghost commented 6 years ago

You doing a big update mate? see elastic dissapeared. can you tell me what your upto? kind regards

Raxvis commented 6 years ago

Yup. Did a huge rewrite for easier maintenance in the future along with not having to statically build out the system for updates.

On top of that, AlphaReign is going to be pivoting into a standalone system in the future hopefully.

New Goal (still flushing it out)

The goal of AlphaRegin is create a piece of software that creates a network on top of the BitTorrent DHT Network. This extra network layer is meant negotiate the list of torrents that exists in the DHT Network. On top of that, it will provide a way to search through torrents

ghost commented 6 years ago

Great job im looking forward to it :) will you also be updating your php front end to work with the new changes? im only asking because ive built on top of your php front end . kind regards

Raxvis commented 6 years ago

I will actually be writing a stand-alone front end for it in electron which will be JavaScript

ghost commented 6 years ago

Great looking forward to it :)

ghost commented 6 years ago

May i ask do you have a rough idea when elasticsearch will be added to the scraper? Kind regards

Raxvis commented 6 years ago

Not sure yet. I am writing the scraper from the ground up and it's a little tedious

ghost commented 5 years ago

Yeah i bet it is. Any update for elastic mate?

Kind regards

Raxvis commented 5 years ago

Sorry for taking so long to get you feedback.

I had some other work come up that I have been taking care of. I hope to get back and finish this up soon.

Raxvis commented 5 years ago

@ash121121

Sorry for taking so long on the updates. Right now I am going through some final updates and testing on the crawler to ensure that the code maintains the crawling capabilities.

The good news is that torrents will be stored in a sqlite database for persistence, so that way if have clear out elasticsearch, you don't lose all your torrents. This will also make it easier to force updates, query information, add additional metadata (votes, etc etc) and in general make it more usable in the long term.

Are you using the php site or did you write your own site? Because if you are using the www, that will be the next thing to be re-written and I would like to get feedback and input for your use case.

WarezAddict-com commented 5 years ago

I was using the old crawler (sending data to elastic) and the slim php frontend on warezaddict.com for a little while. I don't have anything running now tho!

I tried the new rewrite crawler a few times. I only got it working once and always received errors every time after that. It was nothing major... i just never had time to use my 1337 Google-Fo Skillz and fix it. I really liked using the slim php (micro) framework. It seemed fast and responsive to me, but I'm not up with the bleeding edge tech and full stack etc stuff. I am no coder/dev and I don't really know a programming languages. I can read code (PHP/JS) and tell what is going on, but I can't write a program myself. I do good to fix small things here and there or hack around on something simple like this php front end. I know PHP is "out" now days and people are going to javascript (async, non-blocking? I guess?).

But anyways, hopefully someone else will be more helpful and give better feedback. It wish i could help... I'd love to code a new Vue + cool datatables frontend for the AlphaReign project.


On Wed, Sep 26, 2018, 7:30 PM William notifications@github.com wrote:

@ash121121 https://github.com/ash121121

Sorry for taking so long on the updates. Right now I am going through some final updates and testing on the crawler to ensure that the code maintains the crawling capabilities.

The good news is that torrents will be stored in a sqlite database for persistence, so that way if have clear out elasticsearch, you don't lose all your torrents. This will also make it easier to force updates, query information, add additional metadata (votes, etc etc) and in general make it more usable in the long term.

Are you using the php site or did you write your own site? Because if you are using the www, that will be the next thing to be re-written and I would like to get feedback and input for your use case.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AlphaReign/scraper/issues/23#issuecomment-424903710, or mute the thread https://github.com/notifications/unsubscribe-auth/Aesa4JklmRQSTQlFHMhQNZ8tF7a6PH1Iks5ufA4LgaJpZM4T73Ee .

ghost commented 5 years ago

Hey mate no worries. Im using your php code for Skytorrents.lol and so far no issues. It would be nice if the new crawler could still be used with your old php code ? As i beleive your coding it in js now?

Raxvis commented 5 years ago

@ash121121 The crawler should still be compatible with the old php code since the data will still go to elasticsearch

@WarezAddict-com I don't know if I will do a dynamic front end with an api or just a static frontend with a traditional webserver. I guess an api backend would make it easier for others to consume

ghost commented 5 years ago

@Prefinem thanks for the info. so in terms of performance will that be degraded by using sqlite and elastic together? would this mean double disk space usage?

Raxvis commented 5 years ago

SQLite is very performant so we shouldn't see any decrease in performance. In fact, by using SQLite, I will be able to make better updates to elasticsearch which will mean running a smaller elasticsearch instance for the same performance. On top of that, the ability to rebuild the entire elasticsearch instance in an hour or two for all torrents will ensure that if have to upgrade elasticsearch, it will be much easier.

As for the disk space, there will be an increase in that since we will have a database file to account for. That being said, it shouldn't double. Once I get this testing done, I will run it for a week or so to see how large the file gets so I have a better estimation for you.

ghost commented 5 years ago

Ah i see thanks for clearing that up, looking forward to a release. Kind regards

ghost commented 5 years ago

Hey mate do you have any plans on some filters for bad torrents like to remove under age shit ? And whats your thoughts on a way to verify a torrent isnt fake ?

Kind regards

Raxvis commented 5 years ago

Torrent's being fake is a hard one. Ideally you would have people report it and then with enough reports, you would remove it.

As for the bad torrents. I am adding a configurable filters list that will check against any file in the torrent to remove said torrent if it matches the list. I actually pulled your list of bad words into it initially. You can view it here: https://github.com/AlphaReign/scraper/blob/rewrite/config/filters.js

If you have anything else that you would like to see added, just let me know.

Also, just an FYI, I let the new system run over the weekend and have 1.3 million already in the database. The current database size is 5.3 GB so far. It should be ready to publish in a few days once I hookup elasticsearch into the sqlite database.

Raxvis commented 5 years ago

@everyone

The rewrite has been merged into master. I have been running it for a week now without any issues besides some minor bugs that have been fixed. Please give it a test and create an issues you have with it. You will need a database installed. Please give the README a read through and post any questions as issues