ahsan0029 / initial_task_text

1 stars 0 forks source link

Tweet Preprocessing #1

Open wendli01 opened 3 years ago

wendli01 commented 3 years ago

Your preprocessing seems to be a bit too agressive. It turns e.g. WTF, die grüne Grenze zwischen Österreich und Deutschland ist 800 km lang. Kein Problem für #Flüchtlinge. #Grenzkontrollen ist keine Lösung! into TF , die grüne Green zwischen Österreich und Deutschland ist 800 km lang Kevin Problem für fl cht line gren control n ist keine Lösung ! which might be because you remove all unicode.

Overall, your preprocessing should be general enough for transferral between domains (e.g. different languages or social media platforms) and gentle enough to not remove possibly relevant information. Since most geoparsing frameworks are designed to directly work on natural text, not stemmed, lemmatized, ..., too much preprocessing should not be necessary anyway. Furthermore, certain preprocessing steps would not be expected to affect geoparsing, such as resolving enclitic contradictions, as they do not change the sentence context. This could be explored in an ablation study though.

If you want to make your code more scalable, try creating functions that you can apply to a dataframe (in parallel) or use the pandas str interface.

ahsan0029 commented 3 years ago

Thank you I have adjusted my preprocessing and removed lemmatize and others
Now it looks this after preprocessing:

WTF die grüne Grenze zwischen Österreich und Deutschland ist 800 km lang Kein Problem für fl cht linge grenz kontrolle n ist keine Lösung

ahsan0029 commented 3 years ago

I have uploaded annotation.ipynb file

Created manual annotation of the tweet Map lat and long coordinates from location name using DBpedia SPARQL Store coordinates in GeoPandas