Xilinx / logicnets

Apache License 2.0
81 stars 26 forks source link

Unable to train the network #28

Closed tsp6 closed 1 year ago

tsp6 commented 1 year ago

Unable to train the network , I already downloaded the dataset, here is the terminal output :

output when data set is downloded: tsp@d8a9699d9f61:/workspace/logicnets/examples/jet_substructure$ mkdir -p data tsp@d8a9699d9f61:/workspace/logicnets/examples/jet_substructure$ wget https://cernbox.cern.ch/index.php/s/jvFd5MoWhGs1l5v/download -O data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z --2022-11-21 09:37:36-- https://cernbox.cern.ch/index.php/s/jvFd5MoWhGs1l5v/download Resolving cernbox.cern.ch (cernbox.cern.ch)... 128.142.170.17, 128.142.53.35, 128.142.53.28, ... Connecting to cernbox.cern.ch (cernbox.cern.ch)|128.142.170.17|:443... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://cernbox.cern.ch/s/jvFd5MoWhGs1l5v/download [following] --2022-11-21 09:37:36-- https://cernbox.cern.ch/s/jvFd5MoWhGs1l5v/download Reusing existing connection to cernbox.cern.ch:443. HTTP request sent, awaiting response... 200 OK Length: 3648 (3.6K) [text/html] Saving to: ‘data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z’

data/processed-pythia82-lh 100%[=====================================>] 3.56K --.-KB/s in 0s

2022-11-21 09:37:36 (42.9 MB/s) - ‘data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z’ saved [3648/3648]

terminal output while training:

tsp@d8a9699d9f61:/workspace/logicnets/examples/jet_substructure$ python train.py --arch jsc-s --log-dir ./jsc_s/ Traceback (most recent call last): File "train.py", line 292, in dataset['train'] = JetSubstructureDataset(dataset_cfg['dataset_file'], dataset_cfg['dataset_config'], split="train") File "/workspace/logicnets/examples/jet_substructure/dataset.py", line 35, in init with h5py.File(input_file, 'r') as h5py_file: File "/home/tsp/.local/miniconda3/lib/python3.8/site-packages/h5py/_hl/files.py", line 533, in init fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr) File "/home/tsp/.local/miniconda3/lib/python3.8/site-packages/h5py/_hl/files.py", line 226, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 106, in h5py.h5f.open OSError: Unable to open file (file signature not found)

nickfraser commented 1 year ago

Thanks for this. This might be caused because of an update to the h5py package. Could you run the following in python and share the result?

import h5py
print(h5py.__version__)
tsp6 commented 1 year ago

Hello @nickfraser,

I had run following command in docker --> python and the version if h5py is 3.7.0.

Python 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

import h5py print(h5py.version) 3.7.0

Thank You

nickfraser commented 1 year ago

Thank you, I think this might need to be pinned to an earlier version of h5py. I've tested it on h5py version 2.10.0 and 3.4.0 and both work, but not 3.7.0.

I'll have to look into this closer and get back to you.

nickfraser commented 1 year ago

Unfortunately, I'll have to resolve #30 before I can resolve this issue.

nickfraser commented 1 year ago

Hi @tsp6, I've now had a chance to resolve #30. However, I'm still unable to reproduce your issue with h5py version 3.7.0.

I believe there is a separate issue, I think the download has failed.

data/processed-pythia82-lh 100%[=====================================>] 3.56K --.-KB/s in 0s

The download should be ~188M, not 3.65K. See an example working log below:

data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t22     [                          <=>                                                                                                              ] 187.99M  39.8MB/s    in 5.4s`

Note, the output of ls -lh ./data:

nfraser@7ffe1ddd2541:/workspace/logicnets/examples/jet_substructure$ ls -lh ./data/                                                                                                                                                            
total 188M                                                                                                                                                                                                                                     
-rw-r--r-- 1 nfraser <redacted> 188M Dec 21 18:27 processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z

Also the md5sum:

nfraser@7ffe1ddd2541:/workspace/logicnets/examples/jet_substructure$ md5sum ./data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z                                                                                        
3b91f16a1949cb6cf855442867cc26a1  ./data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z

Can you check your md5sum matches 3b91f16a1949cb6cf855442867cc26a1. If not, please try redownloading the file. If you're behind a firewall, it's possible it is blocking the download. Perhaps try downloading it on a different network and moving the file to the working directory.

tsp6 commented 1 year ago

Hello @nickfraser ,

Thank you for getting back to this issue.

I had tried to download again, now the download is good and the checksum is also verified but I have an error with checkpoint : tsp@d820679234cf:/workspace/logicnets/examples/jet_substructure$ wget https://cernbox.cern.ch/index.php/s/jvFd5MoWhGs1l5v/download -O data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z --2023-01-02 11:11:05-- https://cernbox.cern.ch/index.php/s/jvFd5MoWhGs1l5v/download Resolving cernbox.cern.ch (cernbox.cern.ch)... 128.142.53.28, 128.142.170.17, 128.142.53.35, ... Connecting to cernbox.cern.ch (cernbox.cern.ch)|128.142.53.28|:443... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://cernbox.cern.ch/s/jvFd5MoWhGs1l5v/download [following] --2023-01-02 11:11:05-- https://cernbox.cern.ch/s/jvFd5MoWhGs1l5v/download Reusing existing connection to cernbox.cern.ch:443. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/octet-stream] Saving to: ‘data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z’

data/processed-pyth [ <=> ] 187.99M 4.04MB/s in 46s

2023-01-02 11:11:57 (4.07 MB/s) - ‘data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z’ saved [197121108]

tsp@d820679234cf:/workspace/logicnets/examples/jet_substructure$ md5sum ./data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z 3b91f16a1949cb6cf855442867cc26a1 ./data/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z tsp@d820679234cf:/workspace/logicnets/examples/jet_substructure$ python train.py --arch jsc-s> --log-dir ./jsc_s/ usage: train.py [-h] [--arch {jsc-s,jsc-m,jsc-l}] [--weight-decay D] [--batch-size N] [--epochs N] [--learning-rate LR] [--cuda] [--seed SEED] [--input-bitwidth INPUT_BITWIDTH] [--hidden-bitwidth HIDDEN_BITWIDTH] [--output-bitwidth OUTPUT_BITWIDTH] [--input-fanin INPUT_FANIN] [--hidden-fanin HIDDEN_FANIN] [--output-fanin OUTPUT_FANIN] [--hidden-layers HIDDEN_LAYERS [HIDDEN_LAYERS ...]] [--log-dir LOG_DIR] [--dataset-file DATASET_FILE] [--dataset-config DATASET_CONFIG] [--checkpoint CHECKPOINT] train.py: error: unrecognized arguments: ./jsc_s/ tsp@d820679234cf:/workspace/logicnets/examples/jet_substructure$ python train.py --arch jsc-s> --log-dir ./jsc_s/ usage: train.py [-h] [--arch {jsc-s,jsc-m,jsc-l}] [--weight-decay D] [--batch-size N] [--epochs N] [--learning-rate LR] [--cuda] [--seed SEED] [--input-bitwidth INPUT_BITWIDTH] [--hidden-bitwidth HIDDEN_BITWIDTH] [--output-bitwidth OUTPUT_BITWIDTH] [--input-fanin INPUT_FANIN] [--hidden-fanin HIDDEN_FANIN] [--output-fanin OUTPUT_FANIN] [--hidden-layers HIDDEN_LAYERS [HIDDEN_LAYERS ...]] [--log-dir LOG_DIR] [--dataset-file DATASET_FILE] [--dataset-config DATASET_CONFIG] [--checkpoint CHECKPOINT] train.py: error: unrecognized arguments: ./jsc_s/ can you help me with this issue.

thank you

tsp6 commented 1 year ago

Hello @nickfraser , I realised my mistake, there was a extra '<' symbol in the command. Now the trainning is working , I will update and close the issue when the neq2lut alos works.

thank You

tsp6 commented 1 year ago

Hello @nickfraser,

Thank you for the help, now it is nowrking perfectly fine.

Sai