KhronosGroup / NNEF-Tools

The NNEF Tools repository contains tools to generate and consume NNEF documents
https://www.khronos.org/nnef
222 stars 57 forks source link

Problem converting from Caffe2 to NNEF #57

Closed AnthonyBarbier closed 5 years ago

AnthonyBarbier commented 5 years ago

Created from https://github.com/caffe2/models/tree/master/googlenet Commit: 65b8ec4b1eccfbce30e6ab70492c674aafbe007c Command used to convert the model: python -m nnef_converter.convert --input-framework caffe2 --output-framework NNEF --input-model predict_net.pb --data-model init_net.pb --value-info value_info.json --output-model googlenet/graph.nnef

    std::string _filename = "../../googlenet/conv1/7x7_s2_w.dat";
    std::ifstream bin(_filename.c_str(), std::ios::binary);
    nnef::TensorHeader header;
    bin.read((char*)&header, sizeof(header));

    nnef::validate_tensor_header(header);

I added the following debug code in nnef/common/binary.h:

            std::cout<<"Data length "<<header.data_length<<" Item count "<<item_count<<" bits_per_item "<<header.bits_per_item<< " Calculated len "<<(item_count * header.bits_per_item + 7 ) / 8<<std::endl;
            std::cout<<"extents : "<<header.extents<<" Rank "<<header.rank<<std::endl;

Exception:

Data length 32 Item count 9408 bits_per_item 1042716301 Calculated len 21206968                                                                                                                                                              
extents : 0x7ffd0c069bdc Rank 4                                                                                                                                                                                                               
terminate called after throwing an instance of 'std::runtime_error'                                                                                                                                                                          what():  Failed to read binary header from file ../../googlenet/conv1/7x7_s2_w.dat : data length is not compatible with extents and bits per item    

The tensor is 64x3x7x7 so the number of items is correct, however bits_per_item and data_length are wrong.

nnef_convert.log

gyenesvi commented 5 years ago

The header seems to be completely wrong (except the rank, it would be good to print the extents item by item). Where do you get the item_count from, it seems not to come from the header that is read, or do you calculate it from there? By the way, did you check that the path is correct and the istream is opened correctly? What throws the error, the validate_tensor_header call?

AnthonyBarbier commented 5 years ago

Hi Viktor,

The stream seems to be ok.

item_count is the one used in validate_tensor_header, I've added the cout just before the exception gets thrown:

        const uint32_t item_count = std::accumulate(header.extents, header.extents + header.rank, (uint32_t)1, std::multiplies<uint32_t>());
        if ( header.data_length != (item_count * header.bits_per_item + 7) / 8 )
        {
            std::cout<<"Data length "<<header.data_length<<" Item count "<<item_count<<" bits_per_item "<<header.bits_per_item<< " Calculated len "<<(item_count * header.bits_per_item + 7 ) / 8<<std::endl;
            std::cout<<"extents : "<<header.extents<<" Rank "<<header.rank<<std::endl;
            throw Error("data length is not compatible with extents and bits per item");
        }

So it looks like the extents is correct, however the rest is not (data_length and bits_per_item)

Here is what I get in gdb: $1 = (const nnef::TensorHeader &) @0x7fffffffc6d0: {magic = <incomplete sequence \357>, version = "\001", data_length = 32, rank = 4, extents = {64, 3, 7, 7, 8192, 1035248823, 1025065579, 1028203718}, bits_per_item = 1042716301, quant_code = 1042363606, quant_params = "\374\003\221<ܜ\177\273\017\017a=\226\344\255<\037\377\337=+\201[>#\363\031<\031\234F\276m<\017\276\373\273\242<ܥh=\n\360\034>\027\265[=?Τ\276\243\214ɾ\"\366L\276Hc\227=\271ޘ=[\215\200\274"}

The file on disk is 37664 which doesn't seem to match item_count * 4 + header = 9408 x 4 + 128 = 37760

My guess is only 32 bytes are being written instead of the full 128.

Will investigate further

AnthonyBarbier commented 5 years ago

Ok, so looking a bit deeper I think there are a few differences between the way the headers are handled in Python and C++:

AnthonyBarbier commented 5 years ago

The following patch seems to make it work for me (Although it's quite dirty which is why I haven't created a pull request)

diff --git a/au-zone/nnef_converter/common/nnef_data.py b/au-zone/nnef_converter/common/nnef_data.py
index 0022918..bd61849 100644
--- a/au-zone/nnef_converter/common/nnef_data.py
+++ b/au-zone/nnef_converter/common/nnef_data.py
@@ -199,6 +199,8 @@ class Header:
             self._b_tensor_dimensions = b''
             for dim in value:
                 self._b_tensor_dimensions += struct.pack('<I', dim)
+            while len(self._b_tensor_dimensions) < 8*4:
+                self._b_tensor_dimensions += struct.pack('<I',0)
         else:
             self._tensor_dimensions.append(struct.unpack('<I', value)[0]) #int.from_bytes(value, byteorder='little', signed=False))
             self._b_tensor_dimensions += value
@@ -250,6 +252,7 @@ class Header:
         self._len_quantize_algo_string = ''
         self._quantize_algo_string= ''
         self._modified = False
+        self._data_length = 0

         self.header_field_sizes = collections.OrderedDict()
         self.header_field_sizes['magic_number']=2
@@ -274,6 +277,9 @@ class Header:
         assert isinstance(nnef_dtype, NNEFDType), "Header requires nnef_dtype to be of type NNEFDType"
         assert isinstance(quantize_algo_string, str), "Header requires quantize_algo_string to be of type str"

+        self._data_length = nnef_dtype.get_nnef_bit_width() / 8
+        for d in tensor_shape:
+            self._data_length = self._data_length * d
         self.set_tensor_rank(len(tensor_shape))
         self.set_tensor_dimensions(tensor_shape)
         self.set_data_type(nnef_dtype.get_nnef_base_type_value())   #data_type.value
@@ -299,14 +305,14 @@ class Header:
             f.write(self._b_magic_number)
             f.write(self._b_version_major)
             f.write(self._b_version_minor)
-            f.write(self._b_data_offset)
+            f.write(struct.pack('I', self._data_length))
             f.write(self._b_tensor_rank)
             f.write(self._b_tensor_dimensions)
-            f.write(self._b_data_type)
-            f.write(self._b_bit_width)
-            f.write(self._b_len_quantize_algo_string)
+            #f.write(self._b_data_type)
+            f.write(struct.pack('I',int(struct.unpack('B',self._b_bit_width)[0])))
+            f.write(struct.pack('I',int(struct.unpack('H',self._b_len_quantize_algo_string)[0])))
             f.write(self._b_quantize_algo_string)
-            return len(f.getvalue()), f.getvalue()
+            return 128, f.getvalue()

     def read_content(self, f):
         try:

header_fix.txt

gyenesvi commented 5 years ago

The commit 65b8ec4b1eccfbce30e6ab70492c674aafbe007c is for NNEF tools or for GoogleNet? I can't find it in this repo.

Anyways, it seems to me that the Python code exports in the provisional binary format, which has been completely changed in the final version, that's why it does not match. I wonder why we never caught this during testing, probably because the test included the same kind of import code, which covered the problem.

I'll let Au-Zone know this and discuss fixing it. Actually, the exporter should be using the IO code available in the nnef parser package, and should not be duplicated.

AnthonyBarbier commented 5 years ago

That's the hash from the Caffe2 Zoo The NNEF-Tools hash is 6825938ba466ead23be5e43a470c918fbd7d78ab

I'll let Au-Zone know this and discuss fixing it. Actually, the exporter should be using the IO code available in the nnef parser package, and should not be duplicated.

Agreed, I assume that's what you've done for the Caffe parser and I haven't seen any issue with it.

gyenesvi commented 5 years ago

Maybe you could try to fix it based on the Caffe code, invoking the same binary export method (should be nnef.write_tensor with a file name and a numpy array argument). I am not working during the holidays, and I am not sure about the Au-Zone guys either, so if you need something quick, you might be better off not waiting for them.

gyenesvi commented 5 years ago

We have pushed some updates to the Caffe converters, and reading the binaries should be fixed. Can you try if it works now?