Closed CharlesLiuyx closed 6 years ago
Oh, that's awkward. It seems like I forgot to add the similarity recalculation part and only added the results from the old dataset. I used word2vec and tf for that but I cannot find the script anymore so I'll have to rewrite it. Sorry about that!
Anyway, I plan an update soon mainly because the preprocessing is a problem and it's a bit of a headache to train on 7.07 with the current state of the project (made the wrong assumption that Valve releases heroes with consecutive indices). Thanks for the heads-up!
@andreiapostoae yeah! I wrote a script to do the data-preprocessing for mapping 119 to 115 and 120 to 116 to solve this wired ID problem
BTW, there are some 4V5 game which produces some bugs. I also solve it by data-preprocessing
If you are interested in a collaboration, you could create a pull request. It would be very helpful for everybody. Regarding the heroes indices, what you did is a simple fix, but there is also the problem with the 24 index which is missing, so I thought about implementing a more elegant solution that is future-proof (e.g. using the reversed dictionary of heroes from metadata.json to handle indices).
@andreiapostoae OHOH! that is why there are lots of "23"s in code. That part really makes me confused. ^-^.
I have already downloaded data from Dec 1st to Jan 16th. If you need it, I could share it which was already clean with Google Drive . But my code organization and structure are both dirty comparing to your amazing project. Just some data process gists are useful so the pull request emmm, feel ashamed to do so.
Looking forward to your update!!! Exciting!!
I understand and appreciate your insights! I thought that I should host the patches data on a public Google Drive as well instead of here, as there are size limitations and it's also not elegant. If your mined data resembles the one from 706e.zip, then I would appreciate you sharing so you save me a bit of time.
@andreiapostoae Yes, you could leave me the public google drive url with gitter or right here? I'll put my 707.zip file on it I have done the 707 data processing except the similarity_file problem and it works properly but as you said before, the heroes' dictionary need to be refactoring
Here is the link. I will remove the edit option of the folder after you upload and I will do my best to keep the URL updated. I will also document this in the README. Thanks a lot for the help!
upload done!
Hi! Could you please tell me the update time schedule? I am working on a project to build a model to predict the e-sport bet result and I am going to refer to your result about the draft winrate. The professional matches are all in 7.07d. If you need some help which I could do, I'm willing to help
I don't get your idea which called reverse dictionary about how to solve the heroes indice problem. More elegant and scalability
In my plan I’m going to write a data-preprocessing module to deal with raw data from opendota API. To ensure that the indices of heroes are from 1-115 consecutively. Is this idea work?
Maybe I also need to add a key named "indice" in metadata.json
Regarding the update schedule, I have exams at this time so I could make an update in about 2-3 weeks.
It's true that I was a bit vague in my idea. I think your idea is simple enough to be extended, so if you would like, go ahead and implement it. The only requirement I have is that you leave the mining script intact (don't change the indices when saving to csv, only do it while preprocessing).
@andreiapostoae nice! The requirement need absolutely to be maintain. I'll write a new script to implement it.
BTW. Due to the missing of similarity_file recalculating script. I can't shape the feature completely for 7.07 data at all. The _query_missing() part.
It's ok, I will handle the similarities part when I get the time. Thanks!
No problem! Good luck to the exams. And something I forgot is that the dataset I uploaded following by _re which means the data was already pre-processing by mapping 119 to 115 and 120 to 116
Add a new pull request. And I think if you could try to add the script how to calculate 7.06e similarity_all.csv, I could update this project for 7.07 and make sure every part works properly.
Dude! How are you doing? The patch has updated to 7.09 already. I plan to build some models to determine the advantage during different time slot, for example, 0-12 min 12-25min 25-40min >40min. I really need your kindly help about the similarities part! Thanks a lot
Hi! Please just tell me how to generate Similarity file, I could continue this project as possible as I could to follow the latest patch!
Or just uploading the sample Script which could generate the similarities.csv
file. It's hard to read code line by line.
I'm sorry, but I really do not have the time to do this, even though I tried. As I previously said, I do not have the code anymore, but I can tell you what I did.
For each game in the dataset, I considered each hero to be a word, and each game to be a sentence. Therefore I did not use labels (who won the game). I then trained word2vec (skip-gram flavor) using tensorflow: word2vec example and when investigating similarities and clustering them using T-SNE, I discovered that even though the algorithm knew no relation between those heroes, it understood the heroes' roles. The similarity between two heroes is the cosine distance between two words' (Dota heroes) embeddings, so the shorter the distance, the more similar those heroes are (role-wise). I also used a special delimiter word, 'PAD' to split the radiant and dire teams.
When I get the time, I will do this along with more changes in the preprocessing pipeline, a neural network implementation and current patch compatibility update.
@andreiapostoae Thank you, I will try my best to continue your work to update it to 7.11 following SteamDataBase! After you finish your Bachelor Thesis, We could discussion it
After data processing and training process with 7.07 data, I don't know how to re-calculate the similarities_all.csv
P.S. I plotted the hero map successfully
Thanks a lot!