amberT15 / LLM_eval

Code repository for study ''Evaluating the representational power of pre-trained DNA language models for regulatory genomics"
MIT License
17 stars 1 forks source link

gLM evaluation analysis pipeline

This repository contains code to generate resutls for the study "Evaluating the representational power of pre-trained DNA language models for regulatory genomics". (Pre-print link)

The data_generation folder contains script for the pre-processing of datsets, and notebooks of using each gLM to exract layer embeddings. figure contains code and generated figure for the paper.

The rest of the code is orgnized by task and analysis:

Within each repository are orgnized based on the input. Most folders contain scripts for gLM representation (except NT), NT, and one-hot based model trainings.

Since not all gLMs can be installed in the same environment, three different environments were used during this study, tf_requirments.yml, torch_requirments.yml and gpn_requirements.yml.

Original dataset and models trained for this study can be accessed from zenodo, they should be decompressed into the base folder for this repo. No installation is required to run analysis in this repository