1712n / challenge

Challenge Program
65 stars 27 forks source link

Geotagging - identify relevant tweets #78

Closed alinapark closed 1 year ago

alinapark commented 2 years ago

The goal of this challenge is to create a model to identify tweets referring to the user's location.

Deliverable:

The goal of the challenge is to create a model to flag tweets containing information useful for identifying the user's location. The model should consume texts on the input, and return a confidence score for the tweet to be relevant to identifying the user's location. The applicant should also provide the threshold confidence value for the interpretation of the score.

For example: "born and raised here! Yankees forever!" would have a higher confidence score than "AWS has been down for a while now" or "my favorite cities are Paris, Milan and London"

The relevancy of a tweet should be evaluated by predicting the author's location, and comparing the prediction with an actual location (as per the training data values)

Training Data:

The training data set may be found here. It includes over 8M tweets, along with the user's geographical coordinates (latitude and longitude) and basic metadata, such as language and the time of post.

Results Evaluation:

We'll evaluate the model with the data set not shared in the challenge.

To make the evaluation process easier, please make sure your code includes a function that:

To submit your solution:


Successful submissions

🎉 @orzhan successfully solved the challenge and was hired by @inca-digital

hivaze commented 2 years ago

Correct the description of the task, please, it's not clear what needs to be predicted, what does the topic and geolocation have to do with it?

alinapark commented 2 years ago

@hivaze the model should analyze a piece of text and predict whether it is useful to identifying the user's location. I added a couple of examples - hope it helps

AbdullahMakhdoom commented 2 years ago

Hi @alinapark, The training dataset does not have a label column, using which the model could learn and tell whether a tweet is useful for identifying a user's location or not. For example: "born and raised here! Yankees forever!" will have a positive label while "my favorite cities are Paris, Milan and London" will have a negative label. Is it expected from us to label the data ?

alinapark commented 2 years ago

@AbdullahMakhdoom you're right - it doesn't. The task itself is to create a model that will identify geolocation-related texts based on the data provided. The methodology is up to the candidates - whether that's unsupervised learning, annotations, or any other methods

joebutcher commented 2 years ago

@alinapark do you have a preference on what level of location we should be predicting? For example, for tweets in the U.S., should we be predicting city, state, closest large metro area, etc.? Any guidance here would be helpful.

delattre1 commented 2 years ago

@alinapark Should the model work only with texts in English (en), or all those other languages inside the example dataset?

evgenydmitriev commented 2 years ago

@alinapark do you have a preference on what level of location we should be predicting? For example, for tweets in the U.S., should we be predicting city, state, closest large metro area, etc.? Any guidance here would be helpful.

The model performance will be measured based on the distance between predicted and actual coordinates.

@alinapark Should the model work only with texts in English (en), or all those other languages inside the example dataset?

The model performance will be measured using our test dataset, withheld from the challenge. Just like the training dataset, it contains many non-English tweets.

lidiaToropova commented 2 years ago

@evgenydmitriev According to task description

The goal of the challenge is to create a model to flag tweets containing information useful for identifying the user's location. The model should consume texts on the input, and return a confidence score for the tweet to be relevant to identifying the user's location.

it's a binary classification problem. But according to your comment

The model performance will be measured based on the distance between predicted and actual coordinates.

the model should predict coordinates.

So should the model output a confidence score for the tweet to be relevant to identifying the user's location or predict coordinates?

evgenydmitriev commented 2 years ago

@lidiaToropova correct, the only one way you would be able to judge the relevancy of the tweets is the distance between the predicted and the actual coordinates. The relevancy part is to highlight the importance of high-confidence decisions and to allow candidates to filter out the noise tweets early on.

@alinapark you might want to include an example of a successful submission and expand the evaluation section to make things more obvious.

ScientificCollaboration commented 2 years ago

Provided BERT based solution. Data was pre-processed with NLP - geoprocessing into coordinate - class targets. Got an issue with groundtruth data recoding. The problem as I suspect is that in some cases groundtruth lattitude and longitude does not correspond to the real coordinates. For example in some tweets comments unveil coordinates very precisely but coordinates provided as groundtruth are different. https://github.com/ScientificCollaboration/Geocodes/blob/main/geo_error.jpg

evgenydmitriev commented 1 year ago

🎉 @orzhan successfully solved the challenge and was hired by @inca-digital! Congrats 🎉

sultanovazamat commented 1 year ago

@evgenydmitriev @alinapark Hi! sent you the solution, could you please check out the inbox? Thanks!

sultanovazamat commented 1 year ago

@evgenydmitriev @alinapark How long does it usually take for you to check the provided solution? It's been more than 2 weeks since I sent it:(