arnor-sigurdsson / EIR

A toolkit for training deep learning models on genotype, tabular, sequence, image, array and binary data.
https://eir.readthedocs.io/
GNU Affero General Public License v3.0
24 stars 5 forks source link

Replication of results for T2D #58

Closed Zeming-LI-Andy closed 1 year ago

Zeming-LI-Andy commented 1 year ago

Hello Author,

I attempted to replicate the results of the experiment for T2D, but unfortunately, I was not successful in doing so.

I have some doubts about the data types of the input. Could you please help me verify if my understanding is correct? Here is my understanding of the input data types: 0(0/0):1000; 1(1/0 or 0/1):0100; 2(1/1):0010; missing(./.):0001

arnor-sigurdsson commented 1 year ago

Hi Zeming-LI-Andy!

Thank you for checking out EIR and getting in touch.

Your understanding of the one-hot encoding scheme seems correct. I'd be happy to help clarify any issues, but first, I need a bit more information:

  1. Are you using EIR, or are you using your own implementation?
  2. Could you please provide more details about the issue you're experiencing? Any error messages or more detailed information would be helpful.

Best, Arnór

Zeming-LI-Andy commented 1 year ago

Thanks for your response.

Yes, I am using the EIR library. And I am using the recommended parameters from the official website.

input_2d_lcl.yaml:{ input_info: input_source: /tmp/cszmli/LCL_data/ukbb_t2d_arrays_2d input_name: chromosome_as_array input_type: array model_config: model_type: lcl model_init_config: kernel_width: 8 first_kernel_expansion: 1 }

globals.yaml:{ checkpoint_interval: 10 sample_interval: 2 n_epochs: 50 memory_dataset: True device: cuda:0 early_stopping_patience: 20 valid_size: 0.3 }

图片1

This figure is the Manhattan plot of UKBB-T2D data that I used for model training. Based on the analysis of the plot, I believe that there are no issues with my data.

arnor-sigurdsson commented 1 year ago

Great, thank you for the information! I agree that, based on the plot, there do not seem to be any glaring issues with your data.

For the configurations, there are a couple of things:

  1. Arrays as input: I can see that we are using the input_type: array functionality here, which, as the name suggests, is for structured arrays. I assume you are using the configuration from this tutorial, where the array input functionality is shown for the Human Origins genotype data. Now, I recently added the array input functionality, and have not tested it on any “real” cohorts, like the UKBB. It is mentioned briefly in the tutorial, but given an array shape of (4, n_SNPs), the LCL array functionality will not cover the first 2 SNPs, but rather the first row of the first 8 SNPs. This latter approach I have not tested, and I am uncertain if it will work particularly well outside toy examples as shown in the tutorial.

  2. LCL and GLN: Based on the previous point, I would therefore recommend using the genotype-specific functionality in EIR for UKBB scale data. I have attached some example configurations that I have found to work well, but some tuning might be needed (note that this assumes the arrays are of shape (4, n_SNPs), i.e., the output matching that of running plink-pipelines). I might add some suggested configurations like these to the tutorials. It can be a bit annoying to have to do such an extensive configuration for simple experiments, and I started working on the EIR-auto-GP to make this a bit easier for people.

  3. Configuration: I can see in your configuration that you have checkpoint_interval and sample_interval to 10 and 2 respectively. Note that these flags are not for epochs, but rather iterations (i.e, mini-batches). Therefore, I would say that 10 and 2 are quite low, especially for UKBB scale data. I would recommend something between 200 and 1000. Kindly find the attached configurations for an example.

Here are the configuration files. Train using eirtrain --global_configs globals.yaml --input_configs input_gln.yaml input_tabular.yaml --fusion_configs fusion.yaml --output_configs output.yaml. Of course, you need to fill in the relevant information for your experiments, system, etc.

I hope this nudges us in the right direction – please let me know if you run into other issues / have any more questions.

Zeming-LI-Andy commented 1 year ago

OK, thanks for your constructions and I reset the parameters.

image

The reason why I set a small checkpoint_interval and sample_interval is that the training process is easy to early stop.

image

According to the validtion results, we could see that the validation AUROC is still poor.

arnor-sigurdsson commented 1 year ago

Thank you for the update. Indeed, the network does not seem to be learning much of anything (as seen by the ~0.694 loss and, as you pointed out, poor ROC-AUC). In my experience, this can be due to some parameters being misconfigured – particularly for the large genotype inputs, it can be a bit sensitive to the parameters. So, the ones specified in the tutorials might not transfer directly over to, e.g., the UKBB.

For example, we probably need to lower the default learning rate, 0.001 to, e.g., 0.0001. I can see that you are now using the genotype input, which is good. However, we most likely need to implement some of the other configurations I previously outlined here. Can you please try with the configurations outlined in the linked config.zip file and see if that makes a difference? Thank you.

Zeming-LI-Andy commented 1 year ago

Thank you for providing your configurations. I ran a test based on your settings.

Here are the results.

image image

We observed that the training loss decreased rapidly in iteration 16, but the validation loss remained stable initially and increased in the later iterations. And the highest achievable auroc result on the validation set is only 0.54.

I attempted to identify strong SNPs (p_value < 0.05) to train the GLN model, and achieved a highest achievable auroc result of 0.888 on the validation set.

arnor-sigurdsson commented 1 year ago

Performance Divergence

This definitely looks a little better, thank you for the update. However, it is interesting that the training and validation losses diverge so quickly, around iteration 2000 clearly. Assuming a batch size of 64, 64 × 2000 = 128000 samples have gone through the network, which is not close to all the samples in the UKBB. One question, how many samples from the UKBB are you using, the full set of ~480K samples? Note that you might want to reduce the 0.3 validation set size you had earlier if you have all the samples.

Related to the divergence in training / validation, one factor here might be the weighted sampling, where during training we have a more or less equal number of each class per batch, but the validation has the original, imbalanced distribution. I would suggest perhaps training for a bit longer (you can either increase the sample_interval and checkpoint_interval parameters, or alternatively, increase the early_stopping_patience parameter in the global configuration), if you plan on still using the full set and if you do not mind.

Below are example training curves from runs I did before, comparing single and multitask performance on T2D. Indeed, 0.54 is quite low. One factor could be whether covariates (e.g., age, sex, and genomic PCs) are included. They are in the runs below, which can make a difference. Another observation that the models are in some cases reaching max performance around iteration 20K.

Screenshot 2023-06-15 at 12 17 51

My apologies for the manual tuning process here, as I mentioned earlier, I am hoping that the EIR-auto-GP (WIP) project can reduce this somewhat.

SNP filtering

I think it's a great approach to use filtering with p-values from, e.g., GWAS to reduce the number of SNPs. One thing to be a bit careful of is that if you apply the filtering based on statistics on the full dataset, it can lead to some data leakage. Even if you restrict it to the training+validation set, the validation performance can be inflated, and lead to a discrepancy between validation and test set performance. I guess a viable approach is doing it only on the training set, then manually specifying the train/validation split (using the manual_valid_ids_file flag, see here for more information).

Zeming-LI-Andy commented 1 year ago

Sorry for not introducing my dataset clearly earlier.

I am using a subset of the UKBB, consisting of 4000+ samples, half of which are individuals with T2D and the other half are healthy individuals.

I want to apologize again for not using covariates in my analysis, as I am only interested in testing the model's capacity using genetic data.

Zeming-LI-Andy commented 1 year ago

I have a question regarding Figure 2B presented in your paper, where you reported an AUROC value for each of the eight diseases using a dataset of 413,736 individuals with British, Irish, or other Western European background. I was wondering if the results shown in Figure 2B were obtained from a single-task analysis (e.g. predicting whether a patient has T2D)?

image

arnor-sigurdsson commented 1 year ago

Ah, I see, no worries! The fact that you are working with “only” ~4000 individuals probably explains some of what we are seeing, specifically the training curve patterns (few samples, easier to overfit). Furthermore, it most likely explains ROC-AUC performance being around 0.54 (more samples will likely give better performance, even when only using genotype data).

For the figure, yes indeed it's from a binary classification of T2D with the models being trained to predict only that.

Zeming-LI-Andy commented 1 year ago

May I ask how many samples were used to train the T2D binary classification task, and did the trainingdata consist of both T2D patients and healthy individuals, similar to my dataset?

arnor-sigurdsson commented 1 year ago

Yes, for this all 413,736 individuals are used for training and validation – the case/control count of depends on the disease in question (i.e., a balanced subsample was not used, simply all the samples). The case/control status was determined according to ICD10 codes (see supplementary data), I think now the latest data has around 50K samples with positive T2D in the UKBB, but I might be wrong. So, a bit differently from your dataset, it's not using a subset, and it was not balanced w.r.t. case/control status.

Zeming-LI-Andy commented 1 year ago

Oh, I see, out of the 413,736 individuals, 50,000 samples were cases with T2D, and the remaining samples were controls, right?

arnor-sigurdsson commented 1 year ago

Yes, correct, I think, though at the time it was closer to 25K case samples (I think the numbers are updated with time).

Zeming-LI-Andy commented 1 year ago

Thank you so much for addressing my confusion.

arnor-sigurdsson commented 1 year ago

You are more than welcome, I hope it was useful! If there are no more questions at this point, feel free to close the issue.