Closed ChaitanyaBaweja closed 4 years ago
To explain further: I have some partially - labeled Twitter data where some users have location info and others don't. I wish to use your model to impute locations for our dataset.
I can arrange our data in the format that you use for training (user id, lat, long, concatenated tweets.) But I need to get the output into a city label format. So, the model should return a city name for each user.
I would be grateful if you can suggest the best path for achieving this.
Hi Chaitania,
The locations are originally lat, lon. I cluster them into several clusters/regions, and then classify users into these clusters. After a user is classified into one of these regions, you can set their location as the median training coordinate in that location.
If you look at the code there is classLatMedian, classLonMedian if you use data.py to convert your data into a proper dataset. Those are the median points in each class. E.g. if a user is classified into class 9, their coordinate is classLatMedian[9], classLonMedian[9]. Having the coordinates, you can easily find the city using Google Map API, or openstreetmap api.
Does that answer the question? If not, I'll be more than happy to help you run this.
Hi Afshin,
Thank you for your prompt response. That does answer the issue that I was facing to a certain extent. I will look into classLatMedian and update you once I implement this.
Suppose I have a dataset with the lat long info for 10% of the people. I need to get the location information for the remaining 90%. What should my training and test set look like?
Best Chaitanya
Use the 10% as training and validation e.g. 9% for training and 1% for validation. The other 90% for test.
Thank you for the comment. I will try this out.
Best Chaitanya
Do I need to format the code to fit this setting? Because I don't have any labels for the 90% data that is now being used as test set. In data.py, the code will still look for lat, long information.
It needs the lat, lon for test set because it is trying to evaluate the final results. Give them the same lat, lon for all 90% test instances e.g. 10, 10, but understandably the test errors will not be meaningful, then take the predictions, after your training.
Thank you
I wish to get label information out. Which label corresponds to what city?