TermMatchAI

TermMatchAI is a library designed to match terms between two datasets using exact syntactical matching, fuzzy search matching, and semantic matching with AI models.

What is the use case?

Data is much less useful if you are unable to compare data between different datasets. My lab measures the general sampling environment with term name of 'habitat', with data like: 'oceanic mesopelagic zone biome', and another lab uses the term name 'env_local_scale' with data like: 'marine photic zone [ENVO:00000209]'. This code will tell you that they are the closest match to one another.

Please see term_matching_report.csv for an example output. To format it in Excel, Press:

CTRL + A
ALT + O + C + A

The project uses a custom trained version of SciBERT. SciBERT is a SentenceBERT AI model that is trained on scientific vocabulary and data. The model used in this repo, eDNA_scibert_model, is SciBERT with custom training from eDNA repositories such as DarwinCore and MIMARKS MIXs.

This project is under active development. Please raise an issue or reach out to bayden.willms@noaa.gov for any questions.

Setup

1. Clone the Repository

First, clone the repository:

git clone https://github.com/baydenwillms/TermMatchAI.git
cd TermMatchAI

2. Conda Environment

Environment configuration up to the user. Dependencies are listed in the environment.yml. To set up the environment using Conda:

conda env create -f environment.yml
conda activate term-matching-env

3. Install the AI Models

Spacy Installation:

python -m spacy download en_core_web_lg

Custom eDNA SciBERT model Installation:

eDNA model is managed using Git LFS, Large File Storage. To set this up:
```
git lfs install
git lfs pull
```

Usage

Term Comparison

To compare terms between two datasets, use the main.py script. Ensure your input is formatted correctly as dictionaries:

# Example dictionary for dataset 1
dataset1_terms = {
    "country": {"definition": "The name of the country where the sample was collected.", "examples": "United States"},
    "decimalLatitude": {"definition": "The latitude where the sample was collected.", "examples": "25.7617"},
    "decimalLongitude": {"definition": "The longitude where the sample was collected.", "examples": "-80.1918"},
}

# Example dictionary for dataset 2
dataset2_terms = {
    "nation": {"definition": "The nation of origin.", "examples": "USA"},
    "lat": {"definition": "Latitude coordinate.", "examples": "25.7617"},
    "lon": {"definition": "Longitude coordinate.", "examples": "-80.1918"},
}

Create new dictionaries with your data directly in the core/data_loading.py file. After running main.py, it generates a report that details the matching terms between the two datasets using exact matching, fuzzy search, and AI-based semantic matching. To do so:

python main.py

After running this, a report will be generated by core/generate_report.py. This will create an Excel .xlsx file in the home project directory.

Training the AI Model

Our custom eDNA SciBERT model is located in ai_matching/eDNA_scibert_model/ These are mostly binary files that are unreadable, and you shouldn't directly modify. Instead, it is trained using the ai_matching/eDNA_model_trainer.py script. You will probably have to write a short function to get your training data into a dictionary in the correct format. I suggest looking at the dictionary-building functions already in there, like build_darwincore_dict. You can place CSVs, TSVs, YAMLs, Excel files, etc in the script-dependencies folder to keep things organized and build relative paths. Once ready, you can train the model using:

python ai_matching/eDNA_model_trainer.py

Current Training Data

The eDNA SciBERT model has been trained using:

Darwin Core Vocabulary: CSV
More to come soon!

The eDNA SciBERT model is configured in .gitattributes to be managed using Git LFS, large file storage. No further setup required, just train the model, and commit changes as you normally would.

Citations

SciBERT: Pretrained Language Model for Scientific Text

@inproceedings{Beltagy2019SciBERT,
title={SciBERT: Pretrained Language Model for Scientific Text},
author={Iz Beltagy and Kyle Lo and Arman Cohan},
year={2019},
booktitle={EMNLP},
Eprint={arXiv:1903.10676}
}

SciBERT is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

baydenwillms / TermMatchAI

readme