Sum02dean / MLG

Machine Learning in Genomics Course ETH
MIT License
3 stars 2 forks source link

21 check distribution of gevs #27

Closed LiineKasak closed 2 years ago

LiineKasak commented 2 years ago

I don't understand what data and why you were comparing it... The difference between the val and train sets which you defined is only of that between chromosomes of set A and B: A = {chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr15, chr16, chr17, chr18, chr20, chr21, chr22} B = {chr14, chr19}

If you want to compare the expression between chromosomes, you could use the data loading method load_genes_by_chr to compare each of 22 chromosomes.

If instead you want to compare the expression between the two cell line expressions (which you seemed to want to do, based on your comments) then you need to compare the X1 train+val and X2 train+val expression values. This you can do with my added function load_train_genes_for_cell_line for cell lines 1 and 2.

Sum02dean commented 2 years ago

Neither of your suggestions was my intention. I want to compare all GEX values between the train and validation sets.

This means concatenating train data from X1 and X2 cell lines as one set, and concatenating the validation data from X1 and X2 as the other set.

The intention is to observe ( all GEX values) whether a significant shift in the distribution of GEX values exists. Not between chromosomes, or between cell-lines, but across them.

Of course, we can also subset using the stratifications you suggest too...

An issue that commonly arises in ML is out of distribution (OOD) validation sets which result in some type of training bias.

It is important to see if the validation sets and training sets fall within the same distribution space.

Simply put. I am comparing distribution shifts in the labels, globally across train-test strata. Not between chromosomes or cell-lines.

Sum02dean commented 2 years ago

To compare distributions across Chromosomes splits or cell-line splits. Simply call the function on those specific data partitions. It is a generalized function.

Sum02dean commented 2 years ago

I think we should reject this PR. I need to decide on if I want the func in a NB or to put it in a .py file. In the latter case, I need to add init python files for loading modules, and remove the func body from NB.

I would also like to add the other stratifications that Liine mentioned to include choromosomal, cell-line and global distributions. Then increase the code within plotting_utils dir.

ToDo:

These distribution analyses will help inform on the Train-Val splits and help understand the data more.

LiineKasak commented 2 years ago

ah okay. what disturbs me is that in my eyes the train-valid split they made is a random one in the sense that doesn't matter by which chromosomes you split it.. if you want a general comparison across labels without splitting chromosomes then you should randomly split labels after shuffling train and valid data. right now the differences might be among the chromosomes they decided to split the train and valid, not among random GEX values.

And in the end the validation set they chose probably isn't going to be ours anyways, as a k fold crossvalidation is necessary anyways imo.

No need to reject this PR just improve it to cover more cases. I propose:

  1. totally random division of labels, not by any predefined splits. so instead of splitting train and valid which is split by chr, concat the two sets, and then make a random division I guess?
  2. different splits by chromosomes
  3. split by cell line

And maybe these will help us decide if stratified sampling should be used somehow, but the k fold validation should be split by chromosomes anyways as said by multiple papers.

what did you mean by init files for module loading? why is there a need for them when can just import from the util file?

Sum02dean commented 2 years ago

what did you mean by init files for module loading? why is there a need for them when can just import from the util file?

I wanted to import a function from a different directory. ButI realised that I don't need to add __init__.py I can just add the directory to path at top of NB file with:

sys.path.append('../')

Then call

from plotting_utils.plot_distributions import plot_distributions