Closed snowformatics closed 5 months ago
Thanks for looking at predictions. It's one of my favorite features.
Here is a link to the notebook that (near the end) shows prediction: https://nbviewer.org/github/fastlmm/FaST-LMM/blob/master/doc/ipynb/FaST-LMM.ipynb Can you tell me where you found a broken link?
Also, this line in the notebook: pylab.errorbar(np.arange(mean.iid_count),mean.val,yerr=np.sqrt(np.diag(covariance.val)),fmt='.') should now be: pylab.errorbar(np.arange(mean.iid_count),mean.val[:,0],yerr=np.sqrt(np.diag(covariance.val)),fmt='.')
Please try to run the example in the notebook and let me know if you can. -- Carl
Hi Carl,
thanks for your answer and implementing this useful feature. I just realized that nbviewer.org was blocked from our firewall, so sorry for the confusion. Everything works as expected.
I have one question, I want to run the model fit with 10 different test/train sets:
test_indices = np.random.choice(total_samples, 60, replace=False)
train_indices = np.array([i for i in range(total_samples) if i not in test_indices])
train = snp_reader[train_indices, :]
test = snp_reader[test_indices, :]
When I run this 10x, some models predict very well and for some models the predicted values for all samples are exactly the same. Do you have any idea why this is the case?
Thanks Stefanie
Can you tell me (or remind me) how the size of your bed file? (iid_count and sid_count)?
By the way, this is working for me (with seeds 0 to 10). (I switched to an explicit random number generator so that I can control the seed.)
from pysnptools.snpreader import Pheno, Bed
from fastlmm.inference import FastLMM
import numpy as np
from fastlmm.util import example_file # Download and return local file name
# define file names
bed_fn = example_file('tests/datasets/synth/all.*','*.bed')
snp_reader = Bed(bed_fn, count_A1=True)
pheno_fn = example_file("tests/datasets/synth/pheno_10_causals.txt")
cov_fn = example_file("tests/datasets/synth/cov.txt")
seed = 0 # set to None to get a new seed each time
rng = np.random.default_rng(seed=seed)
test_indices = rng.choice(snp_reader.iid_count, 60, replace=False)
train_indices = list(set(range(snp_reader.iid_count))-set(test_indices))
train = snp_reader[train_indices, :]
test = snp_reader[test_indices, :]
# In the style of scikit-learn, create a predictor and then train it.
fastlmm = FastLMM(GB_goal=2)
fastlmm.fit(K0_train=train,X=cov_fn,y=pheno_fn)
# Now predict with it
mean, covariance = fastlmm.predict(K0_whole_test=test,X=cov_fn)
print("Predicted means and stdevs")
print(mean.val[:,0])
print(np.sqrt(np.diag(covariance.val)))
#Plot actual phenotype and predicted phenotype
whole_pheno = Pheno(pheno_fn)
actual_pheno = whole_pheno[whole_pheno.iid_to_index(mean.iid),:].read()
pylab.plot(actual_pheno.val,"r.")
pylab.plot(mean.val,"b.")
pylab.errorbar(np.arange(mean.iid_count),mean.val[:,0], yerr=np.sqrt(np.diag(covariance.val)),fmt='.')
pylab.xlabel('testing examples')
pylab.ylabel('phenotype, actual (red) and predicted (blue with stdev)')
pylab.show()
Hi Carl,
I have a 200 x 500.000 SNPs dataset.
I still get this problem, probably depending on the sub training set!? I am just wondering why exactly the same predictions and sometimes the prediction are closer to the correct phenotype.
Thanks Stefanie
Yes, I think that sometimes it is just not getting enough training samples, so it is deciding to ignore SNP information and always predict a middle, constant phenotype value. At the extreme, if you have 300 samples and you ask it to predict on 299 you would not expect it to predict well.
Two things to do:
rom pysnptools.snpreader import Pheno, Bed
from fastlmm.inference import FastLMM
import numpy as np
from fastlmm.util import example_file
from sklearn.model_selection import KFold # install scikit-learn
import pylab
# Load data
bed_fn = example_file('tests/datasets/synth/all.*','*.bed')
snp_reader = Bed(bed_fn, count_A1=True)
pheno_fn = example_file("tests/datasets/synth/pheno_10_causals.txt")
pheno = Pheno(pheno_fn)
cov_fn = example_file("tests/datasets/synth/cov.txt")
# KFold setup
n_splits = 10
kf = KFold(n_splits=n_splits)
# Prepare to collect results
results = []
# Start cross-validation
for train_indices, test_indices in kf.split(range(snp_reader.iid_count)):
# Split data into training and testing
train = snp_reader[train_indices, :]
test = snp_reader[test_indices, :]
# Train the model
fastlmm = FastLMM(GB_goal=2)
fastlmm.fit(K0_train=train, X=cov_fn, y=pheno)
# Test the model
mean, covariance = fastlmm.predict(K0_whole_test=test, X=cov_fn)
# Evaluate the model (e.g., calculating some metrics)
actual_pheno = pheno[pheno.iid_to_index(mean.iid), :].read()
# Here you might want to calculate some metric between actual_pheno.val and mean.val
# and append it to results list
# For example: results.append(some_metric_function(actual_pheno.val, mean.val))
# Optionally, plot results for each fold
pylab.plot(actual_pheno.val, "r.")
pylab.plot(mean.val, "b.")
pylab.errorbar(np.arange(mean.iid_count), mean.val[:, 0], yerr=np.sqrt(np.diag(covariance.val)), fmt='.')
pylab.xlabel('Testing examples')
pylab.ylabel('Phenotype, actual (red) and predicted (blue with stdev)')
pylab.show()
# After the loop, aggregate your results for overall performance metrics
# For example, mean_result = np.mean(results)
Thanks Carl, I tried with your dataset from the notebook and another dataset from me and I don't see this problem. So it's seem to be a specific case for the first dataset.
Best regards Stefanie
Addendum, from a dataset which worked very well for genomic prediction, I created random phenotypes set and I see exactly the same behavior. Seems to be the model is not be able to distinguish between different observations and predicts approximately the average.
Thanks again, Stefanie
That sounds right. - Carl
Hi Carl,
one more question, will the train data (K0_train and y) be standardized before training?
fastlmm.fit(K0_train=train, y=pheno)
Thanks Stefanie
Stefanie,
Good question!
Phenotypes are never standardized. (If you want to pre-standardize them with PySnpTools, let me know and I can show you how).
By default, K0_train--if given as SNP data--is standardized twice:
You can replace these defaults when you create the FaSTLMM predictor. (Also, you can give K0_train directly as a similarity matrix instead of as SNPs-from-which-to-create a similarity matrix).
-- Carl
p.s. I don't think you are using covariates, but if you do, by default, they are unit standardized but this, too, can be changed.
Thanks, Carl that makes sense. The two links are very helpful!
Stefanie
Fixed with v 0.6.8
Hi,
I want to try to predict new phenotypes with Fast-LMM but the link to the notebook is dead. I have the format similar to fast-lmm gwas :
bed = Bed(str(bed_file), count_A1=False)
pheno = Pheno(str(pheno_file))
How can I build a model and predict new phenotypes? Should I include the new phenotype in the pheno_file with Nan values?
Thanks Stefanie