Closed LiineKasak closed 2 years ago
Neither of your suggestions was my intention. I want to compare all GEX values between the train and validation sets.
This means concatenating train data from X1 and X2 cell lines as one set, and concatenating the validation data from X1 and X2 as the other set.
The intention is to observe ( all GEX values) whether a significant shift in the distribution of GEX values exists. Not between chromosomes, or between cell-lines, but across them.
Of course, we can also subset using the stratifications you suggest too...
An issue that commonly arises in ML is out of distribution (OOD) validation sets which result in some type of training bias.
It is important to see if the validation sets and training sets fall within the same distribution space.
Simply put. I am comparing distribution shifts in the labels, globally across train-test strata. Not between chromosomes or cell-lines.
To compare distributions across Chromosomes splits or cell-line splits. Simply call the function on those specific data partitions. It is a generalized function.
I think we should reject this PR. I need to decide on if I want the func in a NB or to put it in a .py file. In the latter case, I need to add init python files for loading modules, and remove the func body from NB.
I would also like to add the other stratifications that Liine mentioned to include choromosomal, cell-line and global distributions. Then increase the code within plotting_utils dir.
ToDo:
These distribution analyses will help inform on the Train-Val splits and help understand the data more.
ah okay. what disturbs me is that in my eyes the train-valid split they made is a random one in the sense that doesn't matter by which chromosomes you split it.. if you want a general comparison across labels without splitting chromosomes then you should randomly split labels after shuffling train and valid data. right now the differences might be among the chromosomes they decided to split the train and valid, not among random GEX values.
And in the end the validation set they chose probably isn't going to be ours anyways, as a k fold crossvalidation is necessary anyways imo.
No need to reject this PR just improve it to cover more cases. I propose:
if __name__ == '__main__'
block of the same file, and file paths should always be relative to the file, not absolute.And maybe these will help us decide if stratified sampling should be used somehow, but the k fold validation should be split by chromosomes anyways as said by multiple papers.
what did you mean by init files for module loading? why is there a need for them when can just import from the util file?
what did you mean by init files for module loading? why is there a need for them when can just import from the util file?
I wanted to import a function from a different directory. ButI realised that I don't need to add __init__.py
I can just add the directory to path at top of NB file with:
sys.path.append('../')
Then call
from plotting_utils.plot_distributions import plot_distributions
I don't understand what data and why you were comparing it... The difference between the val and train sets which you defined is only of that between chromosomes of set A and B: A = {chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr15, chr16, chr17, chr18, chr20, chr21, chr22} B = {chr14, chr19}
If you want to compare the expression between chromosomes, you could use the data loading method
load_genes_by_chr
to compare each of 22 chromosomes.If instead you want to compare the expression between the two cell line expressions (which you seemed to want to do, based on your comments) then you need to compare the X1 train+val and X2 train+val expression values. This you can do with my added function
load_train_genes_for_cell_line
for cell lines 1 and 2.