Closed jowagner closed 5 years ago
The elmo_file
is an instance of NpzFile
. You are right that it is not promised anywhere, but its __iter__`` returns the result of
self.files()(https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L234), which returns the zip file entries (which are guaranteed to be in the order in which they appear in the archive; which is exactly how we generate them). So if the sentence representations were added to the archive in correct order, this code _should_ work correctly. There is an
assert` there to check that the sentence lengths match, so a mismatch would be detected.
Sorting the keys on the other hand could break the alignment easily. The default keys for an array saved by a np.savez
are arr_0
, arr_1
, ..., so sorting alphabetically would generate a wrong order.
We are using contextualized embeddings in this way (depending on zip file entry ordering) in several places, so I am not in favor of changing the behaviour. If you want, the code could explicitly call elmo_file.files
to get the zip entries directly without relying on __iter__
, but personally I do not find it necessary.
This sounds like elmo_file
is not a native Python dictionary but something numpy-specific. Unlikely the behaviour changes but how about the following solution (line 2 changed, line 3 new)? In particular for the C/C++ implementation for udpipe 2.0 this may be a better approach as it also avoids having all data in memory twice:
with np.load(elmo_path) as elmo_file:
for i in range(number_of_sentences):
value = elmo_file['arr_%d' %i]
if max_sentence_len: value = value[:max_sentence_len]
if i >= len(self._elmo): self._elmo.append(value)
Disclaimer: I don't have an elmo .npz
file to check the keys are arr_0
etc. I was looking at the code as I need to convert .hdf5
files obtained with https://github.com/HIT-SCIR/ELMoForManyLangs to .npz
to use with --elmo
.
This would require fixed names of the npz entries, which I am not sure we want to do (until now we did not require them, so we could have easily generated our npz files with different or errorneous names). The current behaviour works with Python 3.4+, the only possible problem is that future Numpy versions might decide that the iteration over NpzFile will not return the content of self.files()
, as it is currently not stated in the documentation. But I believe the probability of such change is minimal.
Well, if you plan to use https://github.com/rogersce/cnpy for the C/C++ implementation you will hit the problem there as it uses std::map
https://en.cppreference.com/w/cpp/container/map , which has no guarantees about the order of items.
Actually std::map
has a guarantee about order of items -- the elements are sorted according to their keys. So using cnpy
would not work, as you suggest. However, we will not really need .npz files in the released UDPipe -- we plan to compute them on the fly (and cache them, of course :-). Even if we used .npz files during training to save resources, we already have code for loading them :-)
Cool. Thanks. Sorry I missed the word "sorted" in the std::map
description. Suggest label "wont fix" and close.
For the C/C++ version, it will still be useful to have a way to provide user-defined vectors for tokens as currently possible with --elmo
. Note that these don't even have to be contextualised word embeddings, e.g. acoustic features for the duration of the recognised word.
Yeah, that is true (that externally provided features make sense). Will keep that in mind :-)
Looking at https://github.com/CoNLL-UD-2018/UDPipe-Future/blob/d5c64fc0e2e8e3218ea682042653eb4c45f75933/ud_dataset.py#L155-L158 doesn't this rely on the order of items returned by
items()
to be in the order of sentences in the data set? Googling "numpy dictionary order items()" I don't find any relevant specifications. In caseelmo_file
is a native Python dictionary, insertion-order preservation was only introduced in Python 3.7. Python 3.6 also behaved this way but it was not part of the specification. Is the latter the reason it worked for you or am I wrong about the above code being problematic?Depending on what the keys are (ignored here with
_
), applyingsorted()
beforeenumerate()
may either resolve the issue or break the--elmo
option.