Remove links to nbviewer.org (notebooks don't need it anymore)

snowformatics commented 9 months ago

Hi,

I want to try to predict new phenotypes with Fast-LMM but the link to the notebook is dead. I have the format similar to fast-lmm gwas :

bed = Bed(str(bed_file), count_A1=False) pheno = Pheno(str(pheno_file))

How can I build a model and predict new phenotypes? Should I include the new phenotype in the pheno_file with Nan values?

Thanks Stefanie

CarlKCarlK commented 9 months ago

Thanks for looking at predictions. It's one of my favorite features.

Here is a link to the notebook that (near the end) shows prediction: https://nbviewer.org/github/fastlmm/FaST-LMM/blob/master/doc/ipynb/FaST-LMM.ipynb Can you tell me where you found a broken link?

Also, this line in the notebook: pylab.errorbar(np.arange(mean.iid_count),mean.val,yerr=np.sqrt(np.diag(covariance.val)),fmt='.') should now be: pylab.errorbar(np.arange(mean.iid_count),mean.val[:,0],yerr=np.sqrt(np.diag(covariance.val)),fmt='.')

Please try to run the example in the notebook and let me know if you can. -- Carl

snowformatics commented 9 months ago

Hi Carl,

thanks for your answer and implementing this useful feature. I just realized that nbviewer.org was blocked from our firewall, so sorry for the confusion. Everything works as expected.

I have one question, I want to run the model fit with 10 different test/train sets:

 test_indices = np.random.choice(total_samples, 60, replace=False)
 train_indices = np.array([i for i in range(total_samples) if i not in test_indices])
  train = snp_reader[train_indices, :]
 test = snp_reader[test_indices, :]

When I run this 10x, some models predict very well and for some models the predicted values for all samples are exactly the same. Do you have any idea why this is the case?

Thanks Stefanie

CarlKCarlK commented 9 months ago

Can you tell me (or remind me) how the size of your bed file? (iid_count and sid_count)?

CarlKCarlK commented 9 months ago

By the way, this is working for me (with seeds 0 to 10). (I switched to an explicit random number generator so that I can control the seed.)

from pysnptools.snpreader import Pheno, Bed
from fastlmm.inference import FastLMM
import numpy as np
from fastlmm.util import example_file # Download and return local file name

# define file names
bed_fn = example_file('tests/datasets/synth/all.*','*.bed')
snp_reader = Bed(bed_fn, count_A1=True)
pheno_fn = example_file("tests/datasets/synth/pheno_10_causals.txt")
cov_fn = example_file("tests/datasets/synth/cov.txt")

seed = 0 # set to None to get a new seed each time
rng = np.random.default_rng(seed=seed)

test_indices = rng.choice(snp_reader.iid_count, 60, replace=False)
train_indices = list(set(range(snp_reader.iid_count))-set(test_indices))
train = snp_reader[train_indices, :]
test = snp_reader[test_indices, :]

# In the style of scikit-learn, create a predictor and then train it.
fastlmm = FastLMM(GB_goal=2)
fastlmm.fit(K0_train=train,X=cov_fn,y=pheno_fn)

# Now predict with it
mean, covariance = fastlmm.predict(K0_whole_test=test,X=cov_fn)

print("Predicted means and stdevs")
print(mean.val[:,0])
print(np.sqrt(np.diag(covariance.val)))

#Plot actual phenotype and predicted phenotype
whole_pheno = Pheno(pheno_fn)
actual_pheno = whole_pheno[whole_pheno.iid_to_index(mean.iid),:].read()
pylab.plot(actual_pheno.val,"r.")
pylab.plot(mean.val,"b.")
pylab.errorbar(np.arange(mean.iid_count),mean.val[:,0], yerr=np.sqrt(np.diag(covariance.val)),fmt='.')
pylab.xlabel('testing examples')
pylab.ylabel('phenotype, actual (red) and predicted (blue with stdev)')
pylab.show()

snowformatics commented 9 months ago

Hi Carl,

I have a 200 x 500.000 SNPs dataset.

I still get this problem, probably depending on the sub training set!? I am just wondering why exactly the same predictions and sometimes the prediction are closer to the correct phenotype.

Thanks Stefanie

Figure_1

CarlKCarlK commented 9 months ago

Yes, I think that sometimes it is just not getting enough training samples, so it is deciding to ignore SNP information and always predict a middle, constant phenotype value. At the extreme, if you have 300 samples and you ask it to predict on 299 you would not expect it to predict well.

Two things to do:

Reduce the 60" to 30 or 15 or 10.
Try cross validation as below. With 10-way cross validation you're only predicting on 30 samples at a time. Even better, each sample gets used in test exactly once.

rom pysnptools.snpreader import Pheno, Bed
from fastlmm.inference import FastLMM
import numpy as np
from fastlmm.util import example_file
from sklearn.model_selection import KFold # install scikit-learn
import pylab

# Load data
bed_fn = example_file('tests/datasets/synth/all.*','*.bed')
snp_reader = Bed(bed_fn, count_A1=True)
pheno_fn = example_file("tests/datasets/synth/pheno_10_causals.txt")
pheno = Pheno(pheno_fn)
cov_fn = example_file("tests/datasets/synth/cov.txt")

# KFold setup
n_splits = 10
kf = KFold(n_splits=n_splits)

# Prepare to collect results
results = []

# Start cross-validation
for train_indices, test_indices in kf.split(range(snp_reader.iid_count)):
    # Split data into training and testing
    train = snp_reader[train_indices, :]
    test = snp_reader[test_indices, :]

    # Train the model
    fastlmm = FastLMM(GB_goal=2)
    fastlmm.fit(K0_train=train, X=cov_fn, y=pheno)

    # Test the model
    mean, covariance = fastlmm.predict(K0_whole_test=test, X=cov_fn)

    # Evaluate the model (e.g., calculating some metrics)
    actual_pheno = pheno[pheno.iid_to_index(mean.iid), :].read()

    # Here you might want to calculate some metric between actual_pheno.val and mean.val
    # and append it to results list
    # For example: results.append(some_metric_function(actual_pheno.val, mean.val))

    # Optionally, plot results for each fold
    pylab.plot(actual_pheno.val, "r.")
    pylab.plot(mean.val, "b.")
    pylab.errorbar(np.arange(mean.iid_count), mean.val[:, 0], yerr=np.sqrt(np.diag(covariance.val)), fmt='.')
    pylab.xlabel('Testing examples')
    pylab.ylabel('Phenotype, actual (red) and predicted (blue with stdev)')
    pylab.show()

# After the loop, aggregate your results for overall performance metrics
# For example, mean_result = np.mean(results)

snowformatics commented 9 months ago

Thanks Carl, I tried with your dataset from the notebook and another dataset from me and I don't see this problem. So it's seem to be a specific case for the first dataset.

Best regards Stefanie

snowformatics commented 9 months ago

Addendum, from a dataset which worked very well for genomic prediction, I created random phenotypes set and I see exactly the same behavior. Seems to be the model is not be able to distinguish between different observations and predicts approximately the average.

Thanks again, Stefanie

CarlKCarlK commented 9 months ago

That sounds right. - Carl

snowformatics commented 9 months ago

Hi Carl,

one more question, will the train data (K0_train and y) be standardized before training?

fastlmm.fit(K0_train=train, y=pheno)

Thanks Stefanie

CarlKCarlK commented 9 months ago

Stefanie,

Good question!

Phenotypes are never standardized. (If you want to pre-standardize them with PySnpTools, let me know and I can show you how).

By default, K0_train--if given as SNP data--is standardized twice:

First, by default, the SNPs is "unit" standardized, that is, for each SNP the mean is set to 0, the s.d. to 1, and NaN's filled with zero.
Second, by default, the resulting kernel is "DiagKToN" standardized, that is, its diagonal is made to equal to the number of samples.

You can replace these defaults when you create the FaSTLMM predictor. (Also, you can give K0_train directly as a similarity matrix instead of as SNPs-from-which-to-create a similarity matrix).

-- Carl

p.s. I don't think you are using covariates, but if you do, by default, they are unit standardized but this, too, can be changed.

snowformatics commented 9 months ago

Thanks, Carl that makes sense. The two links are very helpful!

Stefanie

CarlKCarlK commented 7 months ago

Fixed with v 0.6.8

fastlmm / FaST-LMM

Remove links to nbviewer.org (notebooks don't need it anymore) #38