Imageomics / dna-trait-analysis

Goal: to find associations between dna data and visual traits.
MIT License
2 stars 0 forks source link

Read through the code and give questions #6

Closed DavidCarlyn closed 4 months ago

DavidCarlyn commented 4 months ago
liu9756 commented 4 months ago

Training: For the training section in, src/gtp/train_whole_genome.py, I have a couple of questions: firstly, in the step where you initialize the variables to accumulate the Root Mean Square Error (RMSE) for the current epoch, I see that you use Pierce's coefficient for determining the accuracy, and I'd like to ask how you dealt with the possibility of potentially influencing the presence of outliers in the data? Are we pre-filtering the genetic data before training the model? Or do we keep all the genes going for training to ensure the integrity of the data?

Evaluation:In the evaluation section, I noticed that three functions were used: get_shapley_sampling_attr, get_guided_gradcam_attr, and get_saliency_attr. Based on my understanding and analysis, get_shapley_sampling_attr uses the Shapley Values Sampling method to calculate the model's impact on each feature in the input data. It iterates through each batch in the DataLoader, calculates the Shapley Value of each feature and sums it up to get the total impact; the get_guided_gradcam_attr function uses the Guided Grad-CAM method to calculate the model's impact on each pixel in the input data. It iterates through each batch in the DataLoader, calculates the gradient information for each pixel and sums it up to get the total impact; the get_saliency_attr function uses the Saliency method to evaluate the pixel impact as well, and my question is whether the get_guided_gradcam_attr and get_saliency_attr functions are the same as the get_guided_gradcam_attr function. attr are both used to evaluate pixels, so do they have different evaluation dimensions? Do their evaluation results show some kind of linear correlation?

Porcess: Is the convert_bytes(num) function applied to standardize the format? And what is the purpose of the data stored in 'futures'? It seems to me that this is to store the results of the tasks submitted by the ThreadPoolExecutor, so is it possible that the data that is not stored in 'futures' has no results to be processed?

liu9756 commented 4 months ago

Training: For the training section in, src/gtp/train_whole_genome.py, I have a couple of questions: firstly, in the step where you initialize the variables to accumulate the Root Mean Square Error (RMSE) for the current epoch, I see that you use Pierce's coefficient for determining the accuracy, and I'd like to ask how you dealt with the possibility of potentially influencing the presence of outliers in the data? Are we pre-filtering the genetic data before training the model? Or do we keep all the genes going for training to ensure the integrity of the data?

Evaluation:In the evaluation section, I noticed that three functions were used: get_shapley_sampling_attr, get_guided_gradcam_attr, and get_saliency_attr. Based on my understanding and analysis, get_shapley_sampling_attr uses the Shapley Values Sampling method to calculate the model's impact on each feature in the input data. It iterates through each batch in the DataLoader, calculates the Shapley Value of each feature and sums it up to get the total impact; the get_guided_gradcam_attr function uses the Guided Grad-CAM method to calculate the model's impact on each pixel in the input data. It iterates through each batch in the DataLoader, calculates the gradient information for each pixel and sums it up to get the total impact; the get_saliency_attr function uses the Saliency method to evaluate the pixel impact as well, and my question is whether the get_guided_gradcam_attr and get_saliency_attr functions are the same as the get_guided_gradcam_attr function. attr are both used to evaluate pixels, so do they have different evaluation dimensions? Do their evaluation results show some kind of linear correlation?

Porcess: Is the convert_bytes(num) function applied to standardize the format? And what is the purpose of the data stored in 'futures'? It seems to me that this is to store the results of the tasks submitted by the ThreadPoolExecutor, so is it possible that the data that is not stored in 'futures' has no results to be processed?

@DavidCarlyn

kanishkkov commented 4 months ago

[training]: A question I have about the training code is about the model saving. When does the model save on loss and when would it save on pearson correlation? Also, do you think it would be helpful to learn about the structure of the SoyBeanNet model and how it works?

[evaluation]: The whole idea of the evaluation code is something that I am not experienced with so I would like to know if my interpretation of the code is correct. The "get_attribution_points" function uses occlusion to compute attributions. This means that certain parts of the input data are masked to see how the model's output changes to see what the most important parts of the input are. The "get_shapley_sampling_attr" function utilizes shapley value sampling. I am unfamiliar with this and would like to know how this sampling works. The Guided Grad-CAM method is used in the functions "get_guided_gradcam_attr" and in "get_guided_gradcam_attr_test". What is the difference between these two functions? The "get_saliency_attr"function computes attributions using the saliency method. Again, I am not very familiar with how this method works. Also, what does the attribution graph look like?

[preprocessing]: I do not have too many questions, but am I right to assume that the code in the run_pipeline.ipynb processes phenotype data while run_pipeline.ipynb processes genotype data?

@DavidCarlyn

DavidCarlyn commented 4 months ago

Training: For the training section in, src/gtp/train_whole_genome.py, I have a couple of questions: firstly, in the step where you initialize the variables to accumulate the Root Mean Square Error (RMSE) for the current epoch, I see that you use Pierce's coefficient for determining the accuracy, and I'd like to ask how you dealt with the possibility of potentially influencing the presence of outliers in the data? Are we pre-filtering the genetic data before training the model? Or do we keep all the genes going for training to ensure the integrity of the data?

Evaluation:In the evaluation section, I noticed that three functions were used: get_shapley_sampling_attr, get_guided_gradcam_attr, and get_saliency_attr. Based on my understanding and analysis, get_shapley_sampling_attr uses the Shapley Values Sampling method to calculate the model's impact on each feature in the input data. It iterates through each batch in the DataLoader, calculates the Shapley Value of each feature and sums it up to get the total impact; the get_guided_gradcam_attr function uses the Guided Grad-CAM method to calculate the model's impact on each pixel in the input data. It iterates through each batch in the DataLoader, calculates the gradient information for each pixel and sums it up to get the total impact; the get_saliency_attr function uses the Saliency method to evaluate the pixel impact as well, and my question is whether the get_guided_gradcam_attr and get_saliency_attr functions are the same as the get_guided_gradcam_attr function. attr are both used to evaluate pixels, so do they have different evaluation dimensions? Do their evaluation results show some kind of linear correlation?

Porcess: Is the convert_bytes(num) function applied to standardize the format? And what is the purpose of the data stored in 'futures'? It seems to me that this is to store the results of the tasks submitted by the ThreadPoolExecutor, so is it possible that the data that is not stored in 'futures' has no results to be processed?

Great questions!

  1. All data is kept, there is no filtering of the data. It may be something to talk about in the future, but currently we don't.
  2. As for the difference between the different saliency methods, they vary based on where in the model they capture the signal and how they aggregate them. I encourage you to read more about them here: https://captum.ai/api/attribution.html
  3. Due to the size of the data, I tried to implement a multithreading approach to preprocessing the data. Futures are a way to say "I will return a value eventually". Since I'm launching multiple instances of the same code, they all won't be ready when I initially launch them. I may have not understood your question, so feel free to ping me if you would like more information.
DavidCarlyn commented 4 months ago

[training]: A question I have about the training code is about the model saving. When does the model save on loss and when would it save on pearson correlation? Also, do you think it would be helpful to learn about the structure of the SoyBeanNet model and how it works?

[evaluation]: The whole idea of the evaluation code is something that I am not experienced with so I would like to know if my interpretation of the code is correct. The "get_attribution_points" function uses occlusion to compute attributions. This means that certain parts of the input data are masked to see how the model's output changes to see what the most important parts of the input are. The "get_shapley_sampling_attr" function utilizes shapley value sampling. I am unfamiliar with this and would like to know how this sampling works. The Guided Grad-CAM method is used in the functions "get_guided_gradcam_attr" and in "get_guided_gradcam_attr_test". What is the difference between these two functions? The "get_saliency_attr"function computes attributions using the saliency method. Again, I am not very familiar with how this method works. Also, what does the attribution graph look like?

[preprocessing]: I do not have too many questions, but am I right to assume that the code in the run_pipeline.ipynb processes phenotype data while run_pipeline.ipynb processes genotype data?

@DavidCarlyn

Great questions!

  1. The model was originally saved via the lowest loss, but I switched to Pearson correlation coefficient since I believed it was a better signal vs. loss. More about the SoyBeanNet code can be seen in the code, or via this paper: https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2019.01091/full
  2. Many of the differences can be found at their API link: https://captum.ai/api/attribution.html. Some of my differences are based on how I aggregate the values across samples, and other small differences such as taking the mean, median, max, min across the samples. Attribution is commonly done either perturbation (masking, adding noise, shuffling, etc.) and observing the change in the model output/loss, or saliency based which either looks a the model activation or gradient when the input is passed through the model.
  3. I had intended run_pipeline.ipynb to do all the preprocessing before training and evaluation.