Open davidoesch opened 1 year ago
one solution
import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
categories = ["Base Maps, Land Cover, Aerial and Satellite Imagery", "Base Maps, Landscape Models", "Land Cover, Land Use", "Aerial and Satellite Imagery", "Location, Reference Systems", "Elevation", "Political and Administrative Boundaries", "Spatial Planning, Cadastre", "Spatial Planning, Spatial Development", "Cadastre, Land Registry"]
text = "The National Map 1:1 million is a small-scale topographic map giving an overview of Central Europe: Switzerland and its neighbours from Lyons to Salzburg and from Strasbourg to Genoa on a handy overview map (Paris, Vienna, Frankfurt and Marseille on one sheet). The National Map 1:1 million is published in analogue format as a printed map and in digital format as the Swiss Map Raster and Swiss Map Vector."
encoded_text = tokenizer.encode(text, return_tensors="pt")
outputs = model(encoded_text) predictions = outputs[0]
_, predicted_category_index = torch.max(predictions, dim=1)
print(categories[predicted_category_index])
Short update habe einen Test mit https://spacy.io/ und python gemacht. im zip attached
Obwohl ich die Daten trainiert habe, werden nur 50% der ech kategorien basierend auf abstract, titel kurzbezeichnung erkannt. mein ML foo ist hier zu gering.... The lack of improvement in the model can be due to various reasons, such as insufficient training data, the choice of optimizer, and training settings. It's possible that you need to provide more data for the model to learn from, or fine-tune the hyperparameters of the optimizer to make better use of the data you do have. Additionally, it might be helpful to evaluate your model's performance on a held-out evaluation set, to ensure that it's not overfitting to the training data.
Aus meiner Sicht: 50% ist schon ein Ansatz der funktionieren könnte
um die entsprechenden sprachsets/tokanizer zu installieren
python.exe -m spacy download de_core_news_lg
python.exe -m pip install spacy-lookups-data
python.exe -m spacy download de
using NLP on abstract name title, try to create tags/categories
There are several NLP Python libraries that can be used to analyze text and add it to a predefined group. One popular library is the Natural Language Toolkit (NLTK).
NLTK provides a wide range of tools for natural language processing, including text classification. You can use NLTK's nltk.classify module to train a classifier on a dataset of labeled text, and then use the trained classifier to classify new text into predefined groups.
Another popular library is the scikit-learn, it's a machine learning library for Python, it provides various tools for natural language processing, including text classification. With the sklearn.feature_extraction.text.CountVectorizer and sklearn.feature_extraction.text.TfidfVectorizer classes, you can convert a collection of text documents to a matrix of token counts (or TF-IDF values) that can be used as input for a classifier. The sklearn.naive_bayes.MultinomialNB , sklearn.svm.SVC and sklearn.linear_model.LogisticRegression are some of the classifiers provided by scikit-learn which can be used for text classification.
You can also use other libraries such as spaCy, TextBlob, Gensim, etc, they all have their own features and capabilities which you can use to classify text into predefined groups.
It is important to note that, before using these libraries, you need to have labeled data to train the classifier and also preprocess the text data.