AlphaReign / scraper

AlphaReigns DHT Scraper, includes peer updater and categorizer
MIT License
127 stars 35 forks source link

New release #32

Closed ghost closed 5 years ago

ghost commented 6 years ago

Hey just noticed you made the release live. So am i right in thinking i can just migrate from the old scraper to the new one with your php front end ? And use my existing elastic db ? Kind regards

Raxvis commented 6 years ago

That should be correct, with the exception of categories. I think I merged categories into tags. So now you need to only filter by tags now

ghost commented 6 years ago

thanks i tried to follow your install instructions but have failed. "yarn global install pm2" this doesnt seem to work but this does. "yarn global add pm2"

and then i got to "yarn migrate" and got this error

yarn migrate yarn run v1.10.1 $ ./node_modules/.bin/knex migrate:latest /bin/sh: ./node_modules/.bin/knex: No such file or directory error Command failed with exit code 127. info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Raxvis commented 6 years ago

I updated the docs for the yarn global add pm2

Did you run just yarn inside the folder first?

ghost commented 6 years ago

I ran yarn migrate inside scraper folder

Raxvis commented 6 years ago

You will need to run yarn before yarn migrate

ghost commented 6 years ago

Ill give it a go. Ive never used yarn before and i didnt see in the docs you had to run yarn first

Raxvis commented 6 years ago

Ah! Good catch. Sorry about that. It's something I do a lot so I didn't event think about it. I added it to the docs

ghost commented 6 years ago

No worries :) so what do you recommend for my site using your scraper? Currently i have 17 million torrents and i use elasticdump to make dumps every now and then. Should i use sqlite or mysql ?

Kind regards

Raxvis commented 6 years ago

You definitely want to do mysql. SQLite will be slower than MySQL. Plus there are lots of tools for backing up MySQL

ghost commented 6 years ago

Thanks going to try it now. Would there be any performance decrease using this new version over the old one ? Using mysql and elasticsearch wouldnt there be more overhead ?

Raxvis commented 6 years ago

There will be more performance overhead because MySQL will be running, but, we don't update elasticsearch as much which means elasticsearch will run better

Raxvis commented 6 years ago

Just a heads up, I just made a change to ensure that torrents get updated when they get scraped from the tracker. You will want to pull down the latest version

ghost commented 6 years ago

So the scraper feeds the torrents to mysql and then mysql to elastic search ?

ghost commented 6 years ago

so it looks like im stuck on yarn migrate. here is the error.

yarn migrate yarn run v1.10.1 $ ./node_modules/.bin/knex migrate:latest /root/scraper/migrations/20180816161002_init.js:1 (function (exports, require, module, filename, dirname) { exports.up = async (knex) => { ^

SyntaxError: Unexpected token ( at createScript (vm.js:56:10) at Object.runInThisContext (vm.js:97:10) at Module._compile (module.js:549:28) at Object.Module._extensions..js (module.js:586:10) at Module.load (module.js:494:32) at tryModuleLoad (module.js:453:12) at Function.Module._load (module.js:445:3) at Module.require (module.js:504:17) at require (internal/module.js:20:19) at /root/scraper/node_modules/knex/lib/migrate/index.js:92:25 at arrayFilter (/root/scraper/node_modules/lodash/lodash.js:582:11) at filter (/root/scraper/node_modules/lodash/lodash.js:9173:14) at /root/scraper/node_modules/knex/lib/migrate/index.js:91:108 at tryCatcher (/root/scraper/node_modules/bluebird/js/release/util.js:16:23) at Promise._settlePromiseFromHandler (/root/scraper/node_modules/bluebird/js/release/promise.js:509:35) at Promise._settlePromise (/root/scraper/node_modules/bluebird/js/release/promise.js:569:18) error Command failed with exit code 1. info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

ghost commented 6 years ago

also just to make you aware your docs state to update config.json but this file doesnt exist. i think you meant config/index.js ?

Raxvis commented 6 years ago

So the scraper feeds the torrents to mysql and then mysql to elastic search ?

Yes

so it looks like im stuck on yarn migrate. here is the error.

I will look into it

also just to make you aware your docs state to update config.json but this file doesnt exist. i think you meant config/index.js ?

Fixing now

ghost commented 6 years ago

Thanks mate look forward to a fix

ghost commented 6 years ago

Hey mate any update ?

Raxvis commented 6 years ago

@ash121121 can you give me the output of running node -v in your terminal?

ghost commented 6 years ago

I destroyed the server so will set it up again today and send it to you

Raxvis commented 6 years ago

Okay, thanks.

ghost commented 6 years ago

Just for your information i used yum install nodejs on centos 7.4

Raxvis commented 6 years ago

Okay, good to know.

ghost commented 6 years ago

so i did a fresh install and same error. i ran that command and the version is below.

[root@scw-966035 scraper]# node -v v6.14.3

Raxvis commented 6 years ago

Ah... I have been testing on node 8. I think that might be an issue. Can you upgrade node and give it a test?

ghost commented 6 years ago

Upgraded and got past yarn migrate. Ill let it run for a little while and let you know. Could you update readme to specify version 8 please ?

Thanks

ghost commented 6 years ago

so does this mean its all good?

scraper > Total Torrents: 32392 x x[ 0] scraper xx scraper > Torrents without Tracker: 11 x x xx scraper > Torrents not in Search: 496 x x xx scraper > Total Torrents: 32485 x

Raxvis commented 6 years ago

I will update the README.md.

And yes, that is good. you have 32,392 torrents total. Only 11 are without seeders / leechers information and 496 are not in elasticsearch

ghost commented 6 years ago

Great thanks for your help :) so the torrents withought seeder info. Will these get updated? And the torrents not in elasticsearch will these be pushed to elasticsearch ?

Thanks

Raxvis commented 6 years ago

Yes. Those are just counts since they are run on different processes. This way you can keep track of stuff that hasn't been scraped or pushed to search.

Just more of a notice then anything. If you see the numbers for either torrents without tracker or torrents not in search keep increasing over a long period of time, just let me know. They should be able to keep up, but some configuration might have to be tweaked

ghost commented 6 years ago

Thanks for the info :) just a quick question. Im planning to replace the old scraper with the new release using your php front end. I know you mentioned that categories are now tags ? Do you think the new scraper can still update peer info for the torrents in my db created by the old scraper.

If theres anything else i should know before i do the switch please let me know.

Kind regards

ghost commented 6 years ago

I have attached an image showing the side by side difference in the mappings for a hashID. I think it needs a little work to be compatible with the php front end. I notice time stamps are no longer in UNIX. Peers_Updated has changed to TrackerUpdated . and also on the front end it shows the file sizes but not the files themselfs.

kind regards screencapture-diffchecker-diff-2018-10-19-10_28_44

ghost commented 6 years ago

The time stamps i can convert to unix in php but could you assist on how to manage my current database with the new scraper as peers_updated has changed to TrackerUpdated. In guessing we could add an OR statement within the if statment in js ?

Kind regards

Raxvis commented 6 years ago

Thanks for this diff. I didn't have one before and thought I had everything matched up.

I will fix the path issues (strings instead of arrays)

What is the peers_updated being used for?

ghost commented 6 years ago

thanks mate. the peers_updated was the old scraper to check when the seeder leecher info was last updated. it was in unix format but looks like in the new scraper its now called trackerupdated

Raxvis commented 6 years ago

that is correct. Is it being used in the php code anywhere?

ghost commented 6 years ago

no its not but as im trying to replace the old scraper with the new i wasnt sure how i would update seeder/peer info for all 17 million torrents that have the peer_updated term :D

ghost commented 6 years ago

the new scraper seems great with if your starting a new index. i was just thinking of people that have already built an index with your old scraper, and how when using your new scraper they could still get the torrents updated.

Raxvis commented 6 years ago

Gotcha. I will put in a ticket to write an elasticsearch to db importer so people can update their database with all their current torrents

ghost commented 6 years ago

Thanks mate. Could you also include votes and flags in the importer as votes are stored in elasticsearch also. Is there anything you need me to test? kind regards

Raxvis commented 6 years ago

Will do!

And I think that is all right now.

Actually, is there a way I could get a backup of your elasticsearch cluster?

ghost commented 6 years ago

yes ill make a dump now and Gzip it up for you

ghost commented 6 years ago

Are you familiar with exporting and importing with elastic search? below are the commands if not. npm install elasticdump -g

elasticdump \
    --input=http://127.0.0.1:9200/torrents \
    --output=/home/torrents.json \
    --type=data
elasticdump \
    --input=/home/torrents.json \
    --output=http://127.0.0.1:9200/torrents \
    --type=data

the first is for export the 2nd for import

ghost commented 6 years ago

Do you have an email were i could post the dump ?

Raxvis commented 6 years ago

You can send it to prefinem@gmail.com

ghost commented 6 years ago

All sent mate. Its 4gb packed and 20gb unpacked. Let me know when you downloaded it so i can remove the link.

Raxvis commented 6 years ago

Downloading now. I will let you know when it's done. Thanks!

ghost commented 6 years ago

If you want to download it via terminal on a server let me know and ill remove cloudflare protection for that.

Raxvis commented 6 years ago

I am just downloading it to my development laptop so no worries.

ghost commented 6 years ago

hey @Prefinem how we looking at the progress :)