ArnovanHilten / GenNet

Framework for Interpretable Neural Networks
Apache License 2.0
91 stars 14 forks source link

Multiple genotype.h5 files? #64

Closed KasperFischer closed 2 years ago

KasperFischer commented 2 years ago

Hi,

Thank you for developing GenNet! I am trying out GenNet on my own plink data which I have converted to the .h5 format, which has been partionioned into 20 genotype.h5 files. When running the model I get the following error:

genotype.h5 is missing

How do you run the GenNet model when the genotype data is partitioned into multiple files? I cannot find an example with mulitple genotype-files or how to merge them?

Kind regards!

ArnovanHilten commented 2 years ago

Hi Kasper,

Which commands did you use to convert the data?

There are multiple ways to convert the data. In some steps it does generate multiple .h5 files that will be merged later in one genotype.h5 file. What are the names of the .h5 format files? if it is studyname_imputed you still need to run the convert with merge_transpose (I will clarify this in the documentation).

For large datasets it might indeed be wise to have multiple chunks of data to avoid very large files. If you have multiple .h5 files with patient as rows and SNPs column-wise you can use the newest version of GenNet in the current dev branch. This version supports the use of multiple h5 files. You can just pull the dev branch and run it.

Please let me know if it works :)

Happy to help,

Arno

KasperFischer commented 2 years ago

Thank you for the comments! I used the command:

python GenNet.py convert \
  -g /home/path/to/my/plink/files/ \
  -study_name study_name \
  -step hase_convert

The names of the files are 0_study_name.h5, 1_study_name... etc.

I have around half a million SNPs and 6000 patients in my dataset. This is not considered a large dataset in tradtional GWAS settings, but I don't know how size compares to in this case?

ArnovanHilten commented 2 years ago

6000 patients and half a million SNPs should take less than a hour to preprocess I think. The command you used is for the first step only.

If you want to use all the steps at once use:

python GenNet.py convert \
  -g /home/path/to/my/plink/files/ \
  -study_name study_name \
  -step all

This should take care of all the steps for you. It should skip the step you already did.

KasperFischer commented 2 years ago

It works now, but I run into some other issues regarding CPU ( I suspect). I only have acces to CPU, but the example classification runs fine on my computer. However, when I run it with my newly preprocessed dataset I get many error messages:

mode is classification
weight_possitive_class 1
weight_possitive_class 1
jobid =  3
folder = GenNet_experiment_3
batchsize = 32
lr = 0.001
2021-12-01 10:42:40.256692: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-12-01 10:42:40.256741: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2021-12-01 10:42:40.256769: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (host_name): /proc/driver/nvidia/version does not exist
2021-12-01 10:42:40.257123: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-12-01 10:42:40.296206: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2299850000 Hz
2021-12-01 10:42:40.302226: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f3b48000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-12-01 10:42:40.302271: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_layer (InputLayer)     [(None, 500000)]          0         
_________________________________________________________________
reshape (Reshape)            (None, 500000, 1)         0         
_________________________________________________________________
LocallyDirected_0 (LocallyDi (None, 1, 1)              500000    
_________________________________________________________________
activation (Activation)      (None, 1, 1)              0         
_________________________________________________________________
batch_normalization (BatchNo (None, 1, 1)              2         
_________________________________________________________________
LocallyDirected_1 (LocallyDi (None, 1, 1)              51        
_________________________________________________________________
activation_1 (Activation)    (None, 1, 1)              0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 1, 1)              2         
_________________________________________________________________
flatten (Flatten)            (None, 1)                 0         
_________________________________________________________________
output_layer (Dense)         (None, 1)                 2         
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 499996
Trainable params: 499996
Non-trainable params: 4
_________________________________________________________________
None
WARNING:tensorflow:From /server/users/user/GenNet/GenNet_utils/Train_network.py:77: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
Epoch 1/100
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
Traceback (most recent call last):
  File "GenNet.py", line 158, in <module>
    main(args)
  File "GenNet.py", line 18, in main
    train_classification(args)
  File "//server/users/user/GenNet/GenNet_utils/Train_network.py", line 77, in train_classification
    history = model.fit_generator(
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1465, in fit_generator
    return self.fit(
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
    tmp_logs = train_function(iterator)
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
    return self._call_flat(
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
    outputs = execute.execute(
  File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError:  indices[1] = 1 is not in [0, 1)
     [[node gradient_tape/model/LocallyDirected_1/GatherV2_1 (defined at /server/users/user/GenNet/GenNet_utils/Train_network.py:77) ]] [Op:__inference_train_function_1725]

Function call stack:
train_function

2021-12-01 10:43:08.426339: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
     [[{{node PyFunc}}]]

I suspect this issue is caused by the fact that I don't have cuda GPU, but is there any way to train the network on this number of input variables on the CPU? Also, what is the max capacity of the network on a CPU in terms of number of parameters? I am thinking that the model can run on CPU, since the example classification performed fine.

Kind regards, Kasper

ArnovanHilten commented 2 years ago

Hi Kasper,

Everything should also work on the CPU. It will be slower but it should work. The number of inputs, number of trainable parameters etc only depend on the memory. You can use the htop command in a seperate terminal or you can use system monitor on Ubuntu to check if it fits in your memory while running.

I think the problem might be with your network layout. You go from 500000 inputs (None, 500000, 1) to a single neuron (None, 1, 1). Don't you intend to go to for example 20 000 genes (None, 20000, 1)? In the next layers you go from 1 neuron to 1 neuron (but I think you supplied a mask with 50 values).

Can you describe what kind of prior knowledge/annoatations you want to use and how you created the topology.csv ? If you did not use a topology.csv file but .npz masks, could you explain the steps you took to create those?

Best,

Arno

KasperFischer commented 2 years ago

I am trying to make a fully connected network, in which I have 50 hidden neurons. This is because I want to use your framework to have a network that takes 500,000 SNPs as input with a subsequent gene layer, in which SNPs are only connected to annotated genes. The gene-layer is then followed by a fully connected hidden layer of 50 neurons. This model run is a test as to how to contruct a fully connected network using GenNet. I am still trying to figure out how to do that. I have made the following model so far:

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_layer (InputLayer)     [(None, 536578)]          0         
_________________________________________________________________
reshape (Reshape)            (None, 536578, 1)         0         
_________________________________________________________________
LocallyDirected_0 (LocallyDi (None, 20405, 1)          556983    
_________________________________________________________________
activation (Activation)      (None, 20405, 1)          0         
_________________________________________________________________
batch_normalization (BatchNo (None, 20405, 1)          2         
_________________________________________________________________
LocallyDirected_1 (LocallyDi (None, 1, 1)              20406     
_________________________________________________________________
activation_1 (Activation)    (None, 1, 1)              0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 1, 1)              2         
_________________________________________________________________
flatten (Flatten)            (None, 1)                 0         
_________________________________________________________________
output_layer (Dense)         (None, 1)                 2         
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
=================================================================
Total params: 577,395
Trainable params: 577,391
Non-trainable params: 4

However, a network of my description above should have more parameters than this: It should have 536,578 + (20406*50) = 1,556,878. Hence the network is not defined correctly. To obtain my desired architecture have made the following topolgy.csv file:

        chr layer0_node layer0_name layer1_node layer1_name layer2_node layer2_name
       1:   0           0   rs123           0       NOC2L           0           0
       2:   0           0   rs123           0       NOC2L           1           1
       3:   0           0   rs123           0       NOC2L           2           2
       4:   0           0   rs123           0       NOC2L           3           3
       5:   0           0   rs123           0       NOC2L           4           4
      ---                                                                            
26828896:  25      536577 rs456       20404    tRNA Pro          45          45
26828897:  25      536577 rs456       20404    tRNA Pro          46          46
26828898:  25      536577 rs456       20404    tRNA Pro          47          47
26828899:  25      536577 rs456       20404    tRNA Pro          48          48
26828900:  25      536577 rs456       20404    tRNA Pro          49          49

In which each SNP is repeated 50 times, varying the columns "layer2" only. How do I make one layer fully connected?

ArnovanHilten commented 2 years ago

Hi Kasper

Happy new year! This is not available from command line but you could easily tweak the code yourself,

Go to the GenNet_utils folder and after line 73 of Create_network (https://github.com/ArnovanHilten/GenNet/blob/19ac2d061d2828f083947f4738679a6d9d40b23f/GenNet_utils/Create_network.py#L73) add:

model = K.layers.Dense(units=1)(model)

You probably want to have an activation function after it. So it will be something like this:

model = K.layers.Flatten()(model)

model = K.layers.Dense(units=50)(model)
model = K.layers.Activation("tanh")(model)

output_layer = K.layers.Dense(units=1, name="output_layer",
                              kernel_regularizer=tf.keras.regularizers.l1(l=l1_value))(model)

Just remember that you hardcoded it. If you want to run an original network you have to change it back.

Hope this helps and good luck!

ArnovanHilten commented 2 years ago

Note that adding dense layers will make your network less interpretable

KasperFischer commented 2 years ago

The network works with the dense layers you suggested and also obtains a reasonable AUC, but after applied your changes to the code, the output file "connection_weights.csv" only contains 387 rows, thus omitting many gene layers. I cannot seem to figure out what went wrong here. I would really like an output of alle the weights between SNPs and Genes, but also between Genes and the dense layer, as well as between the dense layer and the output layer. Do you know how to make that happen?

ArnovanHilten commented 2 years ago

Yes unfortunately some of the functions will not be so easy to use if you add your own dense layers.

If you really would like to have this dense layer with 50 neurons between the gene and the output layer I recommend to start a Jupiter notebook and copy and adapt the functions to work with your adaptations tot the network.

All the weights for your network for the best score in the validation set can be found in the bestweights_job.h5

Here is a link to the colab that shows how you could get the weights:

https://colab.research.google.com/drive/13o7LUzHJ19HVAOFDluRFfIqS3ZRmDKFw?usp=sharing

Instead of the Colab you want to activate your virtual environment and then type jupyter notebook. Then you can use the code from the notebook as an example to do what you want.

KasperFischer commented 2 years ago

Ok. Thank you for your assistance. I will adapt the code myself. Best of luck with your future and ongoing research!