1712n / challenge

Challenge Program
65 stars 27 forks source link

Geographic NER in Twitter Bios #65

Closed evgenydmitriev closed 1 year ago

evgenydmitriev commented 2 years ago

This challenge is about Geographic Named Entity Recognition for our Twitter Bio Location Dataset. We use this dataset to measure the effectiveness of our geotagging tools against the baseline, which can be generated by running the latest version of geopy.

To participate in the challenge, submit a pull request to this repository that replaces values predicted by geopy with your solution's results. Expanding the pull request description with your methodology can help us better understand your reasoning and evaluate your submission faster. To make sure your submission doesn't get lost, you can also email your pull request link along with your resume and the link to this challenge to challenge-submission@blockshop.org. Also, don't hesitate to ask us questions by commenting in this issue.

nlptechbook commented 2 years ago

``Just completed the challenge:

https://colab.research.google.com/drive/1ggTM5wyJG7IYN90ubUeIFC1_Rvz8Susg?usp=sharing

nlptechbook commented 2 years ago

Although the model I mentioned in the previous comment (https://colab.research.google.com/drive/1ggTM5wyJG7IYN90ubUeIFC1_Rvz8Susg?usp=sharing) provides a high level of accuracy (99.5+) against the test set, it still uses a single input sequence (a sequence of words) to make a prediction (a sequence of corresponding NER-labels). To achieve a better accuracy - especially when it comes to identifying geo-entities that have been previously unseen by the model - you might use additional input sequences (so called covariates), such as part-of-speech tags of the words in the input sequence and/or syntactic heads of the words in the sequence, and so on. I've implemented a model that allows using such covariates, and I can provide the code. Please look at the model provided in my previous comment first.

nlptechbook commented 2 years ago

The model I built in this colab: https://colab.research.google.com/drive/1jieTeM6LPZGDZFZ7TP_d86jxKU_WhyMM?usp=sharing relies on syntactic dependency analysis and uses three input sequences to identify geo-entities. A significant part of the code in the beginning is specific to preprocessing the Kaggle NER dataset. So, if you use your own dataset, you definitely won't need this part.

Make sure to play with the sample sentences provided in the end of the colab to figure out how good the model is in determining geo-NERs depending on the context (words surrounding a potential geo-NER), especially when it comes to determining previously unseen entities.

ogunsegun commented 2 years ago

I try to analyze the dataset and use another model to build the dataset: https://jovian.ai/ogunsegun/geopy-ch you can check it there

nlptechbook commented 2 years ago

You can use our own dataset, but the model I provided should show good results. It is really context-depended and shows adequate results even for previously unseen entities. As you might see in the provided colab, it interprets word Rongo differently in different sentences.

On 8/19/22, Segun Samuel Oguntunnbi @.***> wrote:

I try to analyze the dataset and use another model to build the dataset: https://jovian.ai/ogunsegun/geopy-ch you can check it there

-- Reply to this email directly or view it on GitHub: https://github.com/1712n/challenge/issues/65#issuecomment-1220508209 You are receiving this because you commented.

Message ID: @.***>

ogunsegun commented 2 years ago

Please I have reupload the file https://drive.google.com/file/d/1_dQBIm05BP9mvHhImJ3d09WOaopFU6Qd/view?usp=sharing

alinapark commented 2 years ago

Thank you both! Coordinating in emails

avarsenev commented 2 years ago

Hi! Please see my code here: https://github.com/avarsenev/INCA/blob/5277dc9689a96ba40c3b17a001ce0ab6201bff46/NLP_Inca_twilocations_fin.ipynb. I suggest using spacy's GPE labels as a way to dismiss a good chunk of nongeopolitical locations present in the data.

ogunsegun commented 2 years ago

Ok I will work on it

On 1 Sep 2022 6:25 p.m., "avarsenev" @.***> wrote:

Hi! Please see my code here: https://github.com/avarsenev/INCA/blob/ 5277dc9689a96ba40c3b17a001ce0ab6201bff46/NLP_Inca_twilocations_fin.ipynb. I suggest using spacy's GPE labels as a way to dismiss a good chunk of nongeopolitical locations present in the data.

— Reply to this email directly, view it on GitHub https://github.com/1712n/challenge/issues/65#issuecomment-1234572586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN5J5COFZE4EPTLGA2CM3BLV4DRKLANCNFSM5GO6LOCA . You are receiving this because you commented.Message ID: @.***>

shell-escape commented 1 year ago

Could you please clarify the criteria by which we consider an entity to be a geolocation? Some of the records in the twitter dataset are quiet ambiguous, and it is difficult to tell if a record contains or is a geolocation without any context. For example, "DC" could mean Washington, comics or something else. How should such cases be handled?

evgenydmitriev commented 1 year ago

@shell-escape, it's up to you what probability threshold you want to establish and how to handle outright junk. No hard requirements there.

nlptechbook commented 1 year ago

I'm working out the approach based on discovering and using trends in BERT embeddings to catch context-dependent semantics. The approach can be applied (adapted) to a range of classification NLP tasks, including the one being discussed here. I described the idea in simple terms on my GitHub at https://github.com/nlptechbook/BERTembeddings. There is also a Colab (https://colab.research.google.com/drive/1k_R1qOS79auwS2JEJ7D1mYMXHXad29fd?usp=sharing) and a summary (in Russian) in this Notion doc: https://www.notion.so/BERT-746b8ac8e4fc47b8bef707eabed14aa3