AI-sandbox / XGMix

13 stars 2 forks source link

Training smoother error #12

Closed guidebortoli closed 3 years ago

guidebortoli commented 3 years ago

Hi, I'm getting an error when running XGMix and I'm trying to figure it out what went wrong...

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Launching XGMix in train mode...
Reading sample maps and splitting in train/val...
path created: 0.8/generated_data/sample_maps
Running simulation...
Fast admix...
path created: 0.8/generated_data/chm22
path created: 0.8/generated_data/chm22/simulation_output
File read: 416330 SNPs for 183 individuals
path created: 0.8/generated_data/chm22/simulation_output/train1
Building founders
Simulating...
Simulating generation  2
Simulating generation  4
Simulating generation  6
Simulating generation  8
Simulating generation  12
Simulating generation  16
Simulating generation  24
Simulating generation  32
Simulating generation  48
Writing generation: 0
Writing generation: 2
Writing generation: 4
Writing generation: 6
Writing generation: 8
Writing generation: 12
Writing generation: 16
Writing generation: 24
Writing generation: 32
Writing generation: 48
path created: 0.8/generated_data/chm22/simulation_output/train2
Building founders
Simulating...
Simulating generation  2
Simulating generation  4
Simulating generation  6
Simulating generation  8
Simulating generation  12
Simulating generation  16
Simulating generation  24
Simulating generation  32
Simulating generation  48
Writing generation: 0
Writing generation: 2
Writing generation: 4
Writing generation: 6
Writing generation: 8
Writing generation: 12
Writing generation: 16
Writing generation: 24
Writing generation: 32
Writing generation: 48
path created: 0.8/generated_data/chm22/simulation_output/val
Building founders
Simulating...
Simulating generation  2
Simulating generation  4
Simulating generation  6
Simulating generation  8
Simulating generation  12
Simulating generation  16
Simulating generation  24
Simulating generation  32
Simulating generation  48
Writing generation: 0
Writing generation: 2
Writing generation: 4
Writing generation: 6
Writing generation: 8
Writing generation: 12
Writing generation: 16
Writing generation: 24
Writing generation: 32
Writing generation: 48
Simulation done.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Preprocessing data...
Initializing XGMix model and training...
Training base models...
Windows done: 3/3
Training smoother...
Traceback (most recent call last):
  File "../XGMIX.py", line 363, in <module>
    main(args, verbose=verbose, run_simulation=run_simulation, founders_ratios=founders_ratios,
  File "../XGMIX.py", line 261, in main
    model = train(args.chm, model_name, gen_map_df, data_path, generations,
  File "../XGMIX.py", line 158, in train
    model.train(X_train1, labels_window_train1, X_train2, labels_window_train2, X_val, labels_window_val, retrain_base=retrain_base, verbose=verbose)
  File "/Users/debortoli/Dropbox/Postdoc/brazil/hgdp/XGMix-master/Utils/XGMix.py", line 229, in train
    self._train_smooth(train[train_split_idx:], train_lab[train_split_idx:])
  File "/Users/debortoli/Dropbox/Postdoc/brazil/hgdp/XGMix-master/Utils/XGMix.py", line 149, in _train_smooth
    tt,ttl = self._get_smooth_data(train,train_lab)
  File "/Users/debortoli/Dropbox/Postdoc/brazil/hgdp/XGMix-master/Utils/XGMix.py", line 138, in _get_smooth_data
    windowed_data[ppl,win,:] = dat[win:win+self.sws].ravel()
ValueError: could not broadcast input array from shape (27,) into shape (225,)

Any ideas of wha went wrong here? Thank you!

weekend37 commented 3 years ago

Hi!

Yes I certainly have an idea and this is most likely something to do with your genetic map / window size parameter. I say that because the model only divides the chromosome into 3 windows but the input still looks fine (416330 SNPs)

Did you modify the config.py file at all (specifically the window size parameter)? If not, it seems as if the genetic map indicates that the chromosome is much shorter (in centiMorgans - about a factor of 100) than it really is.

guidebortoli commented 3 years ago

Hi!

Yes I certainly have an idea and this is most likely something to do with your genetic map / window size parameter. I say that because the model only divides the chromosome into 3 windows but the input still looks fine (416330 SNPs)

Did you modify the config.py file at all (specifically the window size parameter)? If not, it seems as if the genetic map indicates that the chromosome is much shorter (in centiMorgans - about a factor of 100) than it really is.

Thanks for the promptly reply!

I haven't changed the config file. Regarding the genetic map file...It needs to contain all the markers from the reference file too? Or only the ones from the query (admixed vcf that I want to get the LAI)? Because the query vcf file has around 166206 markers while the reference vcf file has the 416330 SNPs that you pointed out...

Thanks!

weekend37 commented 3 years ago

Hi Guidebortoli and sorry for the late reply.

The genetic map actually doesn't need to include any specific marker per se as we use interpolation to infer the actual position from the genetic map file. But this is done in the end when the inference is written out to a file.

What I'm worried about is that the genetic map file indicates that the chromosome is too short i.e. the last marker is said to be at X centiMorgans where it actually should be X*100 centiMorgans. However I ca't tell for sure from the information above.

You do not have to worry about the query file at this moment as it is read, processed and fed to the trained model later in the program (after the error you encountered). But to re-iterate what I said above: no, you don't necessarily need the genetic map to include those specific SNPs that are found in the query file.

guidebortoli commented 3 years ago

Hi @weekend37... After your considerations regarding the problem with the centiMorgans's size I verified the script I was using to generate a genetic map for my SNP file (that takes into consideration another genetic map with a few markers), I've notice that it was indeed dividing the genetic position by 100...

After I took care of it, the program ran smoothly with no errors. Thanks again!

weekend37 commented 3 years ago

Glad to hear it! Marking as closed.