Closed alinapark closed 1 year ago
Correct the description of the task, please, it's not clear what needs to be predicted, what does the topic and geolocation have to do with it?
@hivaze the model should analyze a piece of text and predict whether it is useful to identifying the user's location. I added a couple of examples - hope it helps
Hi @alinapark, The training dataset does not have a label column, using which the model could learn and tell whether a tweet is useful for identifying a user's location or not. For example: "born and raised here! Yankees forever!" will have a positive label while "my favorite cities are Paris, Milan and London" will have a negative label. Is it expected from us to label the data ?
@AbdullahMakhdoom you're right - it doesn't. The task itself is to create a model that will identify geolocation-related texts based on the data provided. The methodology is up to the candidates - whether that's unsupervised learning, annotations, or any other methods
@alinapark do you have a preference on what level of location we should be predicting? For example, for tweets in the U.S., should we be predicting city, state, closest large metro area, etc.? Any guidance here would be helpful.
@alinapark Should the model work only with texts in English (en), or all those other languages inside the example dataset?
@alinapark do you have a preference on what level of location we should be predicting? For example, for tweets in the U.S., should we be predicting city, state, closest large metro area, etc.? Any guidance here would be helpful.
The model performance will be measured based on the distance between predicted and actual coordinates.
@alinapark Should the model work only with texts in English (en), or all those other languages inside the example dataset?
The model performance will be measured using our test dataset, withheld from the challenge. Just like the training dataset, it contains many non-English tweets.
@evgenydmitriev According to task description
The goal of the challenge is to create a model to flag tweets containing information useful for identifying the user's location. The model should consume texts on the input, and return a confidence score for the tweet to be relevant to identifying the user's location.
it's a binary classification problem. But according to your comment
The model performance will be measured based on the distance between predicted and actual coordinates.
the model should predict coordinates.
So should the model output a confidence score for the tweet to be relevant to identifying the user's location or predict coordinates?
@lidiaToropova correct, the only one way you would be able to judge the relevancy of the tweets is the distance between the predicted and the actual coordinates. The relevancy part is to highlight the importance of high-confidence decisions and to allow candidates to filter out the noise tweets early on.
@alinapark you might want to include an example of a successful submission and expand the evaluation section to make things more obvious.
Provided BERT based solution. Data was pre-processed with NLP - geoprocessing into coordinate - class targets. Got an issue with groundtruth data recoding. The problem as I suspect is that in some cases groundtruth lattitude and longitude does not correspond to the real coordinates. For example in some tweets comments unveil coordinates very precisely but coordinates provided as groundtruth are different. https://github.com/ScientificCollaboration/Geocodes/blob/main/geo_error.jpg
🎉 @orzhan successfully solved the challenge and was hired by @inca-digital! Congrats 🎉
@evgenydmitriev @alinapark Hi! sent you the solution, could you please check out the inbox? Thanks!
@evgenydmitriev @alinapark How long does it usually take for you to check the provided solution? It's been more than 2 weeks since I sent it:(
The goal of this challenge is to create a model to identify tweets referring to the user's location.
Deliverable:
The goal of the challenge is to create a model to flag tweets containing information useful for identifying the user's location. The model should consume texts on the input, and return a confidence score for the tweet to be relevant to identifying the user's location. The applicant should also provide the threshold confidence value for the interpretation of the score.
For example: "born and raised here! Yankees forever!" would have a higher confidence score than "AWS has been down for a while now" or "my favorite cities are Paris, Milan and London"
The relevancy of a tweet should be evaluated by predicting the author's location, and comparing the prediction with an actual location (as per the training data values)
Training Data:
The training data set may be found here. It includes over 8M tweets, along with the user's geographical coordinates (latitude and longitude) and basic metadata, such as language and the time of post.
Results Evaluation:
We'll evaluate the model with the data set not shared in the challenge.
To make the evaluation process easier, please make sure your code includes a function that:
To submit your solution:
Successful submissions
🎉 @orzhan successfully solved the challenge and was hired by @inca-digital