CoNLL-UD-2018 / UDPipe-Future

CoNLL 2018 Shared Task Team UDPipe-Future
Mozilla Public License 2.0
39 stars 12 forks source link

relying on undocumented numpy dictionary order? #12

Closed jowagner closed 5 years ago

jowagner commented 5 years ago

Looking at https://github.com/CoNLL-UD-2018/UDPipe-Future/blob/d5c64fc0e2e8e3218ea682042653eb4c45f75933/ud_dataset.py#L155-L158 doesn't this rely on the order of items returned by items() to be in the order of sentences in the data set? Googling "numpy dictionary order items()" I don't find any relevant specifications. In case elmo_file is a native Python dictionary, insertion-order preservation was only introduced in Python 3.7. Python 3.6 also behaved this way but it was not part of the specification. Is the latter the reason it worked for you or am I wrong about the above code being problematic?

Depending on what the keys are (ignored here with _), applying sorted() before enumerate() may either resolve the issue or break the --elmo option.

foxik commented 5 years ago

The elmo_file is an instance of NpzFile. You are right that it is not promised anywhere, but its __iter__`` returns the result ofself.files()(https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L234), which returns the zip file entries (which are guaranteed to be in the order in which they appear in the archive; which is exactly how we generate them). So if the sentence representations were added to the archive in correct order, this code _should_ work correctly. There is anassert` there to check that the sentence lengths match, so a mismatch would be detected.

Sorting the keys on the other hand could break the alignment easily. The default keys for an array saved by a np.savez are arr_0, arr_1, ..., so sorting alphabetically would generate a wrong order.

We are using contextualized embeddings in this way (depending on zip file entry ordering) in several places, so I am not in favor of changing the behaviour. If you want, the code could explicitly call elmo_file.files to get the zip entries directly without relying on __iter__, but personally I do not find it necessary.

jowagner commented 5 years ago

This sounds like elmo_file is not a native Python dictionary but something numpy-specific. Unlikely the behaviour changes but how about the following solution (line 2 changed, line 3 new)? In particular for the C/C++ implementation for udpipe 2.0 this may be a better approach as it also avoids having all data in memory twice:

with np.load(elmo_path) as elmo_file: 
    for i in range(number_of_sentences):
        value = elmo_file['arr_%d' %i]
        if max_sentence_len: value = value[:max_sentence_len] 
        if i >= len(self._elmo): self._elmo.append(value)

Disclaimer: I don't have an elmo .npz file to check the keys are arr_0 etc. I was looking at the code as I need to convert .hdf5 files obtained with https://github.com/HIT-SCIR/ELMoForManyLangs to .npz to use with --elmo.

foxik commented 5 years ago

This would require fixed names of the npz entries, which I am not sure we want to do (until now we did not require them, so we could have easily generated our npz files with different or errorneous names). The current behaviour works with Python 3.4+, the only possible problem is that future Numpy versions might decide that the iteration over NpzFile will not return the content of self.files(), as it is currently not stated in the documentation. But I believe the probability of such change is minimal.

jowagner commented 5 years ago

Well, if you plan to use https://github.com/rogersce/cnpy for the C/C++ implementation you will hit the problem there as it uses std::map https://en.cppreference.com/w/cpp/container/map , which has no guarantees about the order of items.

foxik commented 5 years ago

Actually std::map has a guarantee about order of items -- the elements are sorted according to their keys. So using cnpy would not work, as you suggest. However, we will not really need .npz files in the released UDPipe -- we plan to compute them on the fly (and cache them, of course :-). Even if we used .npz files during training to save resources, we already have code for loading them :-)

jowagner commented 5 years ago

Cool. Thanks. Sorry I missed the word "sorted" in the std::map description. Suggest label "wont fix" and close.

For the C/C++ version, it will still be useful to have a way to provide user-defined vectors for tokens as currently possible with --elmo. Note that these don't even have to be contextualised word embeddings, e.g. acoustic features for the duration of the recognised word.

foxik commented 5 years ago

Yeah, that is true (that externally provided features make sense). Will keep that in mind :-)