TermMatchAI is a library designed to match terms between two datasets using exact syntactical matching, fuzzy search matching, and semantic matching with AI models.
What is the use case?
Please see term_matching_report.csv
for an example output. To format it in Excel, Press:
CTRL + A
ALT + O + C + A
The project uses a custom trained version of SciBERT. SciBERT is a SentenceBERT AI model that is trained on scientific vocabulary and data. The model used in this repo, eDNA_scibert_model, is SciBERT with custom training from eDNA repositories such as DarwinCore and MIMARKS MIXs.
This project is under active development. Please raise an issue or reach out to bayden.willms@noaa.gov for any questions.
First, clone the repository:
git clone https://github.com/baydenwillms/TermMatchAI.git
cd TermMatchAI
Environment configuration up to the user. Dependencies are listed in the environment.yml
. To set up the environment using Conda:
conda env create -f environment.yml
conda activate term-matching-env
Spacy Installation:
python -m spacy download en_core_web_lg
git lfs install
git lfs pull
To compare terms between two datasets, use the main.py
script. Ensure your input is formatted correctly as dictionaries:
# Example dictionary for dataset 1
dataset1_terms = {
"country": {"definition": "The name of the country where the sample was collected.", "examples": "United States"},
"decimalLatitude": {"definition": "The latitude where the sample was collected.", "examples": "25.7617"},
"decimalLongitude": {"definition": "The longitude where the sample was collected.", "examples": "-80.1918"},
}
# Example dictionary for dataset 2
dataset2_terms = {
"nation": {"definition": "The nation of origin.", "examples": "USA"},
"lat": {"definition": "Latitude coordinate.", "examples": "25.7617"},
"lon": {"definition": "Longitude coordinate.", "examples": "-80.1918"},
}
Create new dictionaries with your data directly in the core/data_loading.py
file. After running main.py
, it generates a report that details the matching terms between the two datasets using exact matching, fuzzy search, and AI-based semantic matching. To do so:
python main.py
After running this, a report will be generated by core/generate_report.py
. This will create an Excel .xlsx file in the home project directory.
Our custom eDNA SciBERT model is located in ai_matching/eDNA_scibert_model/
These are mostly binary files that are unreadable, and you shouldn't directly modify. Instead, it is trained using the ai_matching/eDNA_model_trainer.py
script. You will probably have to write a short function to get your training data into a dictionary in the correct format. I suggest looking at the dictionary-building functions already in there, like build_darwincore_dict. You can place CSVs, TSVs, YAMLs, Excel files, etc in the script-dependencies
folder to keep things organized and build relative paths. Once ready, you can train the model using:
python ai_matching/eDNA_model_trainer.py
The eDNA SciBERT model has been trained using:
The eDNA SciBERT model is configured in .gitattributes
to be managed using Git LFS, large file storage. No further setup required, just train the model, and commit changes as you normally would.
@inproceedings{Beltagy2019SciBERT,
title={SciBERT: Pretrained Language Model for Scientific Text},
author={Iz Beltagy and Kyle Lo and Arman Cohan},
year={2019},
booktitle={EMNLP},
Eprint={arXiv:1903.10676}
}
SciBERT is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.