ArnovanHilten / GenNet

Framework for Interpretable Neural Networks
Apache License 2.0
91 stars 14 forks source link

Conversion from plink #84

Closed EpiSlim closed 2 years ago

EpiSlim commented 2 years ago

Hey Arno!

I am trying to use GenNet on a dataset that I converted from the plink format as follows

python GenNet/GenNet.py convert -step all -g GenNet_data/plink -study_name plink -o GenNet_data/input`.

The conversion stage processes without errors and the log includes the right dimensionality of my toy dataset (500 individuals and 60708 SNPs). However, when I proceed to the training stage, the following assertion is failing https://github.com/ArnovanHilten/GenNet/blob/fbed86365f37549505bbda13227e1a34a301327f/GenNet_utils/Create_network.py#L234

The reason why the above assertion is failing is that inputsize evaluates to 500 (number of individuals) while mask_shapes_x[0] evaluates to 60708 (number of SNPs).

Is my understanding correct that inputsize should also be the number of SNPs? If so, where is the issue given the fact that the conversion log correctly displays the number of individuals and number of variants?

Many thanks, Lotfi

ArnovanHilten commented 2 years ago

Hi Lofti,

inputsize should indeed be the number of SNPs. GenNet checks the folder for .h5 files and takes the second dimension as inputsize. During conversion GenNet transposes the plink files. My suspicion is that GenNet opened an intermediate file that was not transposed yet. Do you have multiple .h5 files in the input folder?

You can delete the other .h5 files that are created except for the genotype.h5.

Best,

Arno

EpiSlim commented 2 years ago

Hey Arno! Out of the same plink file, three intermediate .h5 files were created in the genotype subfolder: 0_plink.h5 , 1_plink.h5 and 2_plink.h5.

What changes need to be made so that the final genotype.h5 has the correct dimensions?

Cheers, Lotfi

ArnovanHilten commented 2 years ago

Hi @EpiSlim ,

There should be a file called genotype.h5 in the GenNet_data/input, you can delete the other .h5 files.

If it is not there it could be in the processed_data folder but looking at your command it should be in the GenNet_data/input.

Best,

Arno

EpiSlim commented 2 years ago

OK, do you mean that GenNet is currently using one of the other intermediate files in the training stage?

ArnovanHilten commented 2 years ago

I am not 100% certain, I would need to see more for that. My suspicion is that there are lingering files. In your input folder / datapath should only be:

EpiSlim commented 2 years ago

Issue fixed. Thanks!