Closed extreme4all closed 3 years ago
Hey there - sorry about the data flow mess. I'll work on cleaning that up so it's much more readable.
"OSRS_KNN_V1" - This pickled file is the KNN classifier, you can use:
osrsknn = pickle.load(open("OSRS_KNN_V1","rb"))
osrsknn_predict = osrsknn.predict(PLAYER_IND[2].reshape(1,-1))
print(osrsknn_predict)
to use the classifier.
"ykmfile" - These are the labels produced by KMeans (n_clusters = 300). They are in the order of the input data, and correlate with the "Pnamefile" which are the player names. Ex. ykmfile[8] is the group label of pnamefile[8].
"pnamefile" - These are the player names.
"PIfile" - This is the raw dataform and needs to have x = np.reshape(PIfile,(-1,78))
passed through it in order to produce an array with [ENTRIES,78]. ENTRIES = the number of total names added to the dataset, and 78 being the number of features. This raw dataform has not been normalized or adjusted in any way.
"traindata" - This is just the section of the raw data used in the KNN classifier, and can largely be ignored.
I will focus on making a much more readable format shortly - which should help answer some of your questions.
the data you have are only player names?
FYI we are doing something very similar to: https://www.youtube.com/watch?v=Dk4Yahv2lek&list=PLX9loFun2zNkqwEk3abeMzZnVlT0YPxkp but on a cleaner way.
the data you have are only player names?
No, the data are the stats from the hiscores, located in "PIfile" :)
So if loaded in,
ykmfile = generated labels Pifile.reshape(-1,78) = hiscore data Pnames = names
So that:
ykm[4] is the label for player pname[4], with features PIfile[4]
FYI we are doing something very similar to: https://www.youtube.com/watch?v=Dk4Yahv2lek&list=PLX9loFun2zNkqwEk3abeMzZnVlT0YPxkp but on a cleaner way.
Really cool project!
i don't know how effective the raw players stats are, for detecting bots. Some data engineering, we can scrape every hour, 6 hours, day scrape highscores to get the xp gains over time.
a hypothesis is that bots gain xp at a similar rate, in a specific skill compared to normal people gaining xp in many skills
also gathering labeled data will make it way easier :D
i don't know how effective the raw players stats are, for detecting bots. Some data engineering, we can scrape every hour, 6 hours, day scrape highscores to get the xp gains over time.
a hypothesis is that bots gain xp at a similar rate, in a specific skill compared to normal people gaining xp in many skills
Yeah I would love to scrape the hiscores every 6 hrs, unfortunately there is a rate limit of 2-3 seconds per name. So 100K names could take 69 hrs to scrape.
also gathering labeled data will make it way easier :D
It would be, but we don't know the labels unfortunately, since we don't know/can't easily trust the accuracy of sent in labels for individual players. So kmeans can group players on their stats and output labels for us. Those labels then go into the KNN classifier Which seems to work well so far at least.
It's a very tough situation due to the API ratelimit.
i have some experience with the API limits :), i used to scrape the entire osrs ge. but first things first, some refactoring, a database would be beneficial, what data are we getting from the plugin. it would be nice if we had the following information from the plugin:
i suggest 2 endpoints. Report_player & report_players. both endpoints do inserts in the database, table:
the difference between report_player & report players is that we would set a column in player_reports as Nearby_players, True (1).
in the players table we keep track when a player is created, banned, banned_date. ban is detected if a player is removed from high scores.
For highscores we need some tables Table: Highscores Columns:
Table: Highscores_latest (don't know if needed) Columns:
We would need routes to request data from the database. i suggest:
Can you set up the database side, i can setup the Flask api that will run on the server. what i have described should be the basis for a nice website that can display our best bot detector :D. additionally it should be the basis for our AI idea's.
AI workflow will be the following:
it might also be useful to have the plugin push a user token, so we can stop abuse?
Excellent suggestions. I will work this week and weekend to make the code much easier to read. We will also try to make it so that the reporting player's info will be included, as well as report the location of the found players. I can definitely set up the database side and properly reconfigure everything so that it is very clear and manageable. I have also recently set up a flask app on a Linode server w/ gunicorn and nginx as a test for a switch from Google cloud app ==> Linode. - however my flask app is very rudimentary so changes are highly appreciated. I will let you know once the changes have been made, and when a database will become available - this will definitely assist in improving the workflow from this point onward.
As for the data we are getting from the plugin: Simply player names are being given at this time. Those names are then processed on our end to retrieve the OSRS Hiscore data values. Location and the reporting player were planned to be included in later updates, but we can shift the schedule to include these values earlier on.
the data is in json format?
for an minimum viable product the Location and the reporting player would be really good, combined with a website. It gives people something to show, with a bit of luck sir pugger will pick it up :D.
(recently i've got myself a vps for my tools aswell, but i'm not experienced in any of that linux stuf :p )
maybe send me a message on twitter, so we can share a .env file, @3xtreme4all
Haha no. Embarrassingly, the data is in a text file format. I'm going to try and convert it all into a json format from now on. Also I'd be super excited to have a great looking website where you can look up statistics/etc. That would really be remarkable to add in the future!
Also don't worry - I don't know anything regarding linux. I just followed a tutorial on youtube (As with basically how I've done everything that I've done so far, youtube is the way to go)
hey,
i fail to understand the dataflow, i see you have many pickle files how are they generated?