We tested all the following commands in macOS(v12.1) and Linux(v3.10) environment.
conda env create -f requirements_rfamgen.yaml
conda activate rfamgen
RfamGen works with python3.7 and pytorch 1.8, but probaboly works with newer versions. Most processes are using the following packages. Please download one-by-one if necessary.
pytorch
numpy
pandas
biopython
sklearn
nltk
infernal
viennarna
h5py
All the following commands are examples using RF00234. Computationa times by Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz are commented.
Data prep is perfomed based on a cmfile. A cmfile is a file format used in infernal software. Here, we used a cmfile downloaded from Rfam DB. If you want to build, you can generated with cmbuild
command of infernal. cmbuild
requires SS_cons
(consensus secondary structure) in an alignment file of stockholm format. SS_cons
can be added by manually or consensus structure estimation such as R-scape.
# Optional. For cleaning sequences in Rfam DB.
python scripts/get_tidy_sequences_from_fasta_rfam.py \
--seed_file ./datasets/Rfam.seed \
--rfam RF00234 \
--output_dir datasets/RF00234 \
--cpu 1
python scripts/make_onehot_from_traceback.py \
--fasta datasets/RF00234/RF00234_unique_seed_removed.fa \
--cmfile ./datasets/RF00234/RF00234.cm \
--cpu 1
# real 3m24.649s
# user 3m0.349s
# sys 0m3.531s
After conversion to onehot expressions, split to train/valid/test.
python scripts/split_onehot_train_valid_test.py \
-i datasets/RF00234/RF00234_unique_seed_removed_notrunc_traceback_onehot_cm.h5 \
--train_ratio 0.7 --random_state 42
You can generate a weight file from the training data prepared above [Marks, 2011; Morcos, 2011]. Typically, threshold is 0.01~0.2.
The computation time highly depends on the data size.
python scripts/generate_weight.py \
--mode cm \
-i datasets/RF00234/RF00234_unique_seed_removed_notrunc_traceback_onehot_cm_train.h5 \
--threshold 0.1
# real 1m57.974s
# user 1m7.283s
# sys 1m15.031s
train.py
trains a VAE model, generates config file of training, and optionally generates log/checkpoint files.
Example:
python scripts/train.py \
--data_dir datasets/RF00234 \
--X_train RF00234_unique_seed_removed_notrunc_traceback_onehot_cm_train.h5 \
--w_train RF00234_unique_seed_removed_notrunc_traceback_onehot_cm_train_weight_threshold0p1.h5 \
--X_valid RF00234_unique_seed_removed_notrunc_traceback_onehot_cm_valid.h5 \
--w_valid RF00234_unique_seed_removed_notrunc_traceback_onehot_cm_valid_weight_threshold0p1.h5 \
--epoch 3 \
--beta 1e-3 --use_anneal --use_early_stopping \
--log --log_dir ./outputs/RF00234
# real 0m16.900s
# user 0m12.500s
# sys 0m3.155s
scripts/sampling_from_gauss.py
generates sequences from normal distribution in latent space. You need to input model params, config, and cmfile.
Generation runs very fast unless the cmfile is very long.
python scripts/sampling_from_gauss.py \
--config ./outputs/RF00234/config.yaml \
--ckpt ./outputs/RF00234/model_epoch3.pt \
--cmfile ./datasets/RF00234/RF00234.cm \
--outfasta ./outputs/RF00234/sampled_5seq.fa \
--n_samples 5
# real 0m7.447s
# user 0m6.201s
# sys 0m0.635s
conda env create -f requirements_bayesiankin.yaml
see more details in /notebooks/bayesiankinetics/kinetics_analysis_bypyro.ipynb
It takes less than 1 day to estimate the kinetics of 2000 sequences x2 replicates with SVI.
The NGS data are available on PRJNA1044007 of SRA.