davidoesch / geoservice_harvester_poc

Open Geoservice scraper proof of concept to extract info of each dataset contained in an OGC compliant Geoservice
GNU General Public License v3.0
4 stars 3 forks source link

Tag to inspire or eCH catalog #5

Open davidoesch opened 1 year ago

davidoesch commented 1 year ago

using NLP on abstract name title, try to create tags/categories

There are several NLP Python libraries that can be used to analyze text and add it to a predefined group. One popular library is the Natural Language Toolkit (NLTK).

NLTK provides a wide range of tools for natural language processing, including text classification. You can use NLTK's nltk.classify module to train a classifier on a dataset of labeled text, and then use the trained classifier to classify new text into predefined groups.

Another popular library is the scikit-learn, it's a machine learning library for Python, it provides various tools for natural language processing, including text classification. With the sklearn.feature_extraction.text.CountVectorizer and sklearn.feature_extraction.text.TfidfVectorizer classes, you can convert a collection of text documents to a matrix of token counts (or TF-IDF values) that can be used as input for a classifier. The sklearn.naive_bayes.MultinomialNB , sklearn.svm.SVC and sklearn.linear_model.LogisticRegression are some of the classifiers provided by scikit-learn which can be used for text classification.

You can also use other libraries such as spaCy, TextBlob, Gensim, etc, they all have their own features and capabilities which you can use to classify text into predefined groups.

It is important to note that, before using these libraries, you need to have labeled data to train the classifier and also preprocess the text data.

davidoesch commented 1 year ago

one solution

import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer

Load the pre-trained model and tokenizer

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Define the categories

categories = ["Base Maps, Land Cover, Aerial and Satellite Imagery", "Base Maps, Landscape Models", "Land Cover, Land Use", "Aerial and Satellite Imagery", "Location, Reference Systems", "Elevation", "Political and Administrative Boundaries", "Spatial Planning, Cadastre", "Spatial Planning, Spatial Development", "Cadastre, Land Registry"]

Define the text to classify

text = "The National Map 1:1 million is a small-scale topographic map giving an overview of Central Europe: Switzerland and its neighbours from Lyons to Salzburg and from Strasbourg to Genoa on a handy overview map (Paris, Vienna, Frankfurt and Marseille on one sheet). The National Map 1:1 million is published in analogue format as a printed map and in digital format as the Swiss Map Raster and Swiss Map Vector."

Encode the text and add the special tokens

encoded_text = tokenizer.encode(text, return_tensors="pt")

Get the predictions from the model

outputs = model(encoded_text) predictions = outputs[0]

Get the index of the most likely category

_, predicted_category_index = torch.max(predictions, dim=1)

Print the most likely category

print(categories[predicted_category_index])

davidoesch commented 1 year ago

Short update habe einen Test mit https://spacy.io/ und python gemacht. im zip attached

Obwohl ich die Daten trainiert habe, werden nur 50% der ech kategorien basierend auf abstract, titel kurzbezeichnung erkannt. mein ML foo ist hier zu gering.... The lack of improvement in the model can be due to various reasons, such as insufficient training data, the choice of optimizer, and training settings. It's possible that you need to provide more data for the model to learn from, or fine-tune the hyperparameters of the optimizer to make better use of the data you do have. Additionally, it might be helpful to evaluate your model's performance on a held-out evaluation set, to ensure that it's not overfitting to the training data.

Aus meiner Sicht: 50% ist schon ein Ansatz der funktionieren könnte

scapy_test.zip

um die entsprechenden sprachsets/tokanizer zu installieren

python.exe -m spacy download de_core_news_lg
python.exe -m pip install spacy-lookups-data python.exe -m spacy download de