andreiapostoae / dota2-predictor

Tool that predicts the outcome of a Dota 2 game using Machine Learning
MIT License
365 stars 82 forks source link

How to calculate new similarities_all.csv #15

Closed CharlesLiuyx closed 6 years ago

CharlesLiuyx commented 6 years ago

After data processing and training process with 7.07 data, I don't know how to re-calculate the similarities_all.csv

P.S. I plotted the hero map successfully

Thanks a lot!

andreiapostoae commented 6 years ago

Oh, that's awkward. It seems like I forgot to add the similarity recalculation part and only added the results from the old dataset. I used word2vec and tf for that but I cannot find the script anymore so I'll have to rewrite it. Sorry about that!

Anyway, I plan an update soon mainly because the preprocessing is a problem and it's a bit of a headache to train on 7.07 with the current state of the project (made the wrong assumption that Valve releases heroes with consecutive indices). Thanks for the heads-up!

CharlesLiuyx commented 6 years ago

@andreiapostoae yeah! I wrote a script to do the data-preprocessing for mapping 119 to 115 and 120 to 116 to solve this wired ID problem

BTW, there are some 4V5 game which produces some bugs. I also solve it by data-preprocessing

andreiapostoae commented 6 years ago

If you are interested in a collaboration, you could create a pull request. It would be very helpful for everybody. Regarding the heroes indices, what you did is a simple fix, but there is also the problem with the 24 index which is missing, so I thought about implementing a more elegant solution that is future-proof (e.g. using the reversed dictionary of heroes from metadata.json to handle indices).

CharlesLiuyx commented 6 years ago

@andreiapostoae OHOH! that is why there are lots of "23"s in code. That part really makes me confused. ^-^.

I have already downloaded data from Dec 1st to Jan 16th. If you need it, I could share it which was already clean with Google Drive . But my code organization and structure are both dirty comparing to your amazing project. Just some data process gists are useful so the pull request emmm, feel ashamed to do so.

Looking forward to your update!!! Exciting!!

andreiapostoae commented 6 years ago

I understand and appreciate your insights! I thought that I should host the patches data on a public Google Drive as well instead of here, as there are size limitations and it's also not elegant. If your mined data resembles the one from 706e.zip, then I would appreciate you sharing so you save me a bit of time.

CharlesLiuyx commented 6 years ago

@andreiapostoae Yes, you could leave me the public google drive url with gitter or right here? I'll put my 707.zip file on it default I have done the 707 data processing except the similarity_file problem and it works properly but as you said before, the heroes' dictionary need to be refactoring default

andreiapostoae commented 6 years ago

Here is the link. I will remove the edit option of the folder after you upload and I will do my best to keep the URL updated. I will also document this in the README. Thanks a lot for the help!

CharlesLiuyx commented 6 years ago

upload done!

CharlesLiuyx commented 6 years ago

Hi! Could you please tell me the update time schedule? I am working on a project to build a model to predict the e-sport bet result and I am going to refer to your result about the draft winrate. The professional matches are all in 7.07d. If you need some help which I could do, I'm willing to help

I don't get your idea which called reverse dictionary about how to solve the heroes indice problem. More elegant and scalability

CharlesLiuyx commented 6 years ago

In my plan I’m going to write a data-preprocessing module to deal with raw data from opendota API. To ensure that the indices of heroes are from 1-115 consecutively. Is this idea work?

Maybe I also need to add a key named "indice" in metadata.json

andreiapostoae commented 6 years ago

Regarding the update schedule, I have exams at this time so I could make an update in about 2-3 weeks.

It's true that I was a bit vague in my idea. I think your idea is simple enough to be extended, so if you would like, go ahead and implement it. The only requirement I have is that you leave the mining script intact (don't change the indices when saving to csv, only do it while preprocessing).

CharlesLiuyx commented 6 years ago

@andreiapostoae nice! The requirement need absolutely to be maintain. I'll write a new script to implement it.

BTW. Due to the missing of similarity_file recalculating script. I can't shape the feature completely for 7.07 data at all. The _query_missing() part.

andreiapostoae commented 6 years ago

It's ok, I will handle the similarities part when I get the time. Thanks!

CharlesLiuyx commented 6 years ago

No problem! Good luck to the exams. And something I forgot is that the dataset I uploaded following by _re which means the data was already pre-processing by mapping 119 to 115 and 120 to 116

CharlesLiuyx commented 6 years ago

Add a new pull request. And I think if you could try to add the script how to calculate 7.06e similarity_all.csv, I could update this project for 7.07 and make sure every part works properly.

CharlesLiuyx commented 6 years ago

Dude! How are you doing? The patch has updated to 7.09 already. I plan to build some models to determine the advantage during different time slot, for example, 0-12 min 12-25min 25-40min >40min. I really need your kindly help about the similarities part! Thanks a lot

CharlesLiuyx commented 6 years ago

Hi! Please just tell me how to generate Similarity file, I could continue this project as possible as I could to follow the latest patch!

Or just uploading the sample Script which could generate the similarities.csv file. It's hard to read code line by line.

andreiapostoae commented 6 years ago

I'm sorry, but I really do not have the time to do this, even though I tried. As I previously said, I do not have the code anymore, but I can tell you what I did.

For each game in the dataset, I considered each hero to be a word, and each game to be a sentence. Therefore I did not use labels (who won the game). I then trained word2vec (skip-gram flavor) using tensorflow: word2vec example and when investigating similarities and clustering them using T-SNE, I discovered that even though the algorithm knew no relation between those heroes, it understood the heroes' roles. The similarity between two heroes is the cosine distance between two words' (Dota heroes) embeddings, so the shorter the distance, the more similar those heroes are (role-wise). I also used a special delimiter word, 'PAD' to split the radiant and dire teams.

When I get the time, I will do this along with more changes in the preprocessing pipeline, a neural network implementation and current patch compatibility update.

CharlesLiuyx commented 6 years ago

@andreiapostoae Thank you, I will try my best to continue your work to update it to 7.11 following SteamDataBase! After you finish your Bachelor Thesis, We could discussion it