Closed KasperFischer closed 2 years ago
Hi Kasper,
Which commands did you use to convert the data?
There are multiple ways to convert the data. In some steps it does generate multiple .h5 files that will be merged later in one genotype.h5 file. What are the names of the .h5 format files? if it is studyname_imputed
you still need to run the convert with merge_transpose (I will clarify this in the documentation).
For large datasets it might indeed be wise to have multiple chunks of data to avoid very large files. If you have multiple .h5 files with patient as rows and SNPs column-wise you can use the newest version of GenNet in the current dev branch. This version supports the use of multiple h5 files. You can just pull the dev branch and run it.
Please let me know if it works :)
Happy to help,
Arno
Thank you for the comments! I used the command:
python GenNet.py convert \
-g /home/path/to/my/plink/files/ \
-study_name study_name \
-step hase_convert
The names of the files are 0_study_name.h5, 1_study_name... etc.
I have around half a million SNPs and 6000 patients in my dataset. This is not considered a large dataset in tradtional GWAS settings, but I don't know how size compares to in this case?
6000 patients and half a million SNPs should take less than a hour to preprocess I think. The command you used is for the first step only.
If you want to use all the steps at once use:
python GenNet.py convert \
-g /home/path/to/my/plink/files/ \
-study_name study_name \
-step all
This should take care of all the steps for you. It should skip the step you already did.
It works now, but I run into some other issues regarding CPU ( I suspect). I only have acces to CPU, but the example classification runs fine on my computer. However, when I run it with my newly preprocessed dataset I get many error messages:
mode is classification
weight_possitive_class 1
weight_possitive_class 1
jobid = 3
folder = GenNet_experiment_3
batchsize = 32
lr = 0.001
2021-12-01 10:42:40.256692: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-12-01 10:42:40.256741: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2021-12-01 10:42:40.256769: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (host_name): /proc/driver/nvidia/version does not exist
2021-12-01 10:42:40.257123: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-12-01 10:42:40.296206: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2299850000 Hz
2021-12-01 10:42:40.302226: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f3b48000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-12-01 10:42:40.302271: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_layer (InputLayer) [(None, 500000)] 0
_________________________________________________________________
reshape (Reshape) (None, 500000, 1) 0
_________________________________________________________________
LocallyDirected_0 (LocallyDi (None, 1, 1) 500000
_________________________________________________________________
activation (Activation) (None, 1, 1) 0
_________________________________________________________________
batch_normalization (BatchNo (None, 1, 1) 2
_________________________________________________________________
LocallyDirected_1 (LocallyDi (None, 1, 1) 51
_________________________________________________________________
activation_1 (Activation) (None, 1, 1) 0
_________________________________________________________________
batch_normalization_1 (Batch (None, 1, 1) 2
_________________________________________________________________
flatten (Flatten) (None, 1) 0
_________________________________________________________________
output_layer (Dense) (None, 1) 2
_________________________________________________________________
activation_2 (Activation) (None, 1) 0
=================================================================
Total params: 499996
Trainable params: 499996
Non-trainable params: 4
_________________________________________________________________
None
WARNING:tensorflow:From /server/users/user/GenNet/GenNet_utils/Train_network.py:77: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
Epoch 1/100
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
Traceback (most recent call last):
File "GenNet.py", line 158, in <module>
main(args)
File "GenNet.py", line 18, in main
train_classification(args)
File "//server/users/user/GenNet/GenNet_utils/Train_network.py", line 77, in train_classification
history = model.fit_generator(
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1465, in fit_generator
return self.fit(
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
tmp_logs = train_function(iterator)
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
result = self._call(*args, **kwds)
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
return self._stateless_fn(*args, **kwds)
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
return self._call_flat(
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
outputs = execute.execute(
File "/home/user/env_GenNet/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[1] = 1 is not in [0, 1)
[[node gradient_tape/model/LocallyDirected_1/GatherV2_1 (defined at /server/users/user/GenNet/GenNet_utils/Train_network.py:77) ]] [Op:__inference_train_function_1725]
Function call stack:
train_function
2021-12-01 10:43:08.426339: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]
I suspect this issue is caused by the fact that I don't have cuda GPU, but is there any way to train the network on this number of input variables on the CPU? Also, what is the max capacity of the network on a CPU in terms of number of parameters? I am thinking that the model can run on CPU, since the example classification performed fine.
Kind regards, Kasper
Hi Kasper,
Everything should also work on the CPU. It will be slower but it should work. The number of inputs, number of trainable parameters etc only depend on the memory. You can use the htop
command in a seperate terminal or you can use system monitor on Ubuntu to check if it fits in your memory while running.
I think the problem might be with your network layout. You go from 500000 inputs (None, 500000, 1) to a single neuron (None, 1, 1). Don't you intend to go to for example 20 000 genes (None, 20000, 1)? In the next layers you go from 1 neuron to 1 neuron (but I think you supplied a mask with 50 values).
Can you describe what kind of prior knowledge/annoatations you want to use and how you created the topology.csv
? If you did not use a topology.csv
file but .npz masks, could you explain the steps you took to create those?
Best,
Arno
I am trying to make a fully connected network, in which I have 50 hidden neurons. This is because I want to use your framework to have a network that takes 500,000 SNPs as input with a subsequent gene layer, in which SNPs are only connected to annotated genes. The gene-layer is then followed by a fully connected hidden layer of 50 neurons. This model run is a test as to how to contruct a fully connected network using GenNet. I am still trying to figure out how to do that. I have made the following model so far:
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_layer (InputLayer) [(None, 536578)] 0
_________________________________________________________________
reshape (Reshape) (None, 536578, 1) 0
_________________________________________________________________
LocallyDirected_0 (LocallyDi (None, 20405, 1) 556983
_________________________________________________________________
activation (Activation) (None, 20405, 1) 0
_________________________________________________________________
batch_normalization (BatchNo (None, 20405, 1) 2
_________________________________________________________________
LocallyDirected_1 (LocallyDi (None, 1, 1) 20406
_________________________________________________________________
activation_1 (Activation) (None, 1, 1) 0
_________________________________________________________________
batch_normalization_1 (Batch (None, 1, 1) 2
_________________________________________________________________
flatten (Flatten) (None, 1) 0
_________________________________________________________________
output_layer (Dense) (None, 1) 2
_________________________________________________________________
activation_2 (Activation) (None, 1) 0
=================================================================
Total params: 577,395
Trainable params: 577,391
Non-trainable params: 4
However, a network of my description above should have more parameters than this: It should have 536,578 + (20406*50) = 1,556,878. Hence the network is not defined correctly. To obtain my desired architecture have made the following topolgy.csv file:
chr layer0_node layer0_name layer1_node layer1_name layer2_node layer2_name
1: 0 0 rs123 0 NOC2L 0 0
2: 0 0 rs123 0 NOC2L 1 1
3: 0 0 rs123 0 NOC2L 2 2
4: 0 0 rs123 0 NOC2L 3 3
5: 0 0 rs123 0 NOC2L 4 4
---
26828896: 25 536577 rs456 20404 tRNA Pro 45 45
26828897: 25 536577 rs456 20404 tRNA Pro 46 46
26828898: 25 536577 rs456 20404 tRNA Pro 47 47
26828899: 25 536577 rs456 20404 tRNA Pro 48 48
26828900: 25 536577 rs456 20404 tRNA Pro 49 49
In which each SNP is repeated 50 times, varying the columns "layer2" only. How do I make one layer fully connected?
Hi Kasper
Happy new year! This is not available from command line but you could easily tweak the code yourself,
Go to the GenNet_utils folder and after line 73 of Create_network (https://github.com/ArnovanHilten/GenNet/blob/19ac2d061d2828f083947f4738679a6d9d40b23f/GenNet_utils/Create_network.py#L73) add:
model = K.layers.Dense(units=1)(model)
You probably want to have an activation function after it. So it will be something like this:
model = K.layers.Flatten()(model)
model = K.layers.Dense(units=50)(model)
model = K.layers.Activation("tanh")(model)
output_layer = K.layers.Dense(units=1, name="output_layer",
kernel_regularizer=tf.keras.regularizers.l1(l=l1_value))(model)
Just remember that you hardcoded it. If you want to run an original network you have to change it back.
Hope this helps and good luck!
Note that adding dense layers will make your network less interpretable
The network works with the dense layers you suggested and also obtains a reasonable AUC, but after applied your changes to the code, the output file "connection_weights.csv" only contains 387 rows, thus omitting many gene layers. I cannot seem to figure out what went wrong here. I would really like an output of alle the weights between SNPs and Genes, but also between Genes and the dense layer, as well as between the dense layer and the output layer. Do you know how to make that happen?
Yes unfortunately some of the functions will not be so easy to use if you add your own dense layers.
If you really would like to have this dense layer with 50 neurons between the gene and the output layer I recommend to start a Jupiter notebook and copy and adapt the functions to work with your adaptations tot the network.
All the weights for your network for the best score in the validation set can be found in the bestweights_job.h5
Here is a link to the colab that shows how you could get the weights:
https://colab.research.google.com/drive/13o7LUzHJ19HVAOFDluRFfIqS3ZRmDKFw?usp=sharing
Instead of the Colab you want to activate your virtual environment and then type jupyter notebook
. Then you can use the code from the notebook as an example to do what you want.
Ok. Thank you for your assistance. I will adapt the code myself. Best of luck with your future and ongoing research!
Hi,
Thank you for developing GenNet! I am trying out GenNet on my own plink data which I have converted to the .h5 format, which has been partionioned into 20 genotype.h5 files. When running the model I get the following error:
How do you run the GenNet model when the genotype data is partitioned into multiple files? I cannot find an example with mulitple genotype-files or how to merge them?
Kind regards!