Closed giovannipizzi closed 2 years ago
I am unsure whether this will break the import, i.e., the unpacking of the archives.
Can you not "open" the folder passing encoding='utf8'
, or will this not work with py2?
@CasperWA I think the effect of what you are suggesting and what I am doing should be the same?
@CasperWA I think the effect of what you are suggesting and what I am doing should be the same?
It may be. I am simply unsure what effect the change from w
to wb
will have on the import - probably none, but again, I am unsure.
Actually the suggestion of Giovanni produces the following error
STARTING EXPORT...
RETRIEVING LINKED NODES AND STORING LINKS...
STORING DATABASE ENTRIES...
Exporting a total of 295041 db entries, of which 196012 nodes.
STORING NODE ATTRIBUTES AND EXTRAS...
STORING GROUP ELEMENTS...
STORING DATA...
Traceback (most recent call last):
File "/home/marrazzo/.virtualenvs/migrate_scdm/bin/verdi", line 10, in <module>
sys.exit(verdi())
File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/cmdline/utils/decorators.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/cmdline/commands/cmd_export.py", line 138, in create
export_function(entities, outfile=output_file, **kwargs)
File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/tools/importexport/dbexport/__init__.py", line 92, in export_zip
export_tree(what, folder=folder, silent=silent, **kwargs)
File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/tools/importexport/dbexport/__init__.py", line 415, in export_tree
with folder.open('data.json', mode='wb') as fhandle:
File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/tools/importexport/dbexport/zip.py", line 106, in open
return self._zipfile.open(self._get_internal_path(fname), mode)
File "/usr/lib/python2.7/zipfile.py", line 940, in open
raise RuntimeError, 'open() requires mode "r", "U", or "rU"'
RuntimeError: open() requires mode "r", "U", or "rU"
Indeed, I was assuming that open
of zip works like a standard open, but actually it always open a binary file (at least in python 3): Access a member of the archive as a binary file-like object
(see here).
So we just need to encode, without the need of opening as 'b'. I hope this works also in python2... (anyway we'll try it now):
# N.B. We're really calling zipfolder.open
with folder.open('data.json', mode='w') as fhandle:
# fhandle.write(json.dumps(data, cls=UUIDEncoder))
fhandle.write(json.dumps(data).encode('utf-8'))
What if the export is not done to a zip folder but the .tar.gz? That will not go through the MyZipFolder
class, correct?
Thanks for the testing @AntimoMarrazzo.
For zip files, only "w"
is allowed, because we check for it in our own ZipFolder
class and return an instance of our own MyWritingZipFile
class.
Another thing: The encoding as done by @giovannipizzi above, is the only way to ensure UTF-8 encoding, since the open
function does not have an encoding
parameter (for our own ZipFolder
).
@sphuber yes, if saving as a tar file, this is all water under the bridge.
@sphuber yes, if saving as a tar file, this is all water under the bridge.
As in the original issue is not a problem there, or it is? I think if there you call open(.., 'w')
and then try to write bytes to it, it will fail. The fact that our MyZipFolder
ignores the mode
and silently always opens a byte stream is another problem.
(...) The fact that our
MyZipFolder
ignores themode
and silently always opens a byte stream is another problem.
ZipFolder
does not ignore mode, it checks whether "w"
is passed and returns an instance of MyWritingZipFile
if true, otherwise it uses zipfile.open(..., mode=mode)
.
In MyWritingZipFile
a file is written to the zipfile using zipfile.writestr()
passing a six.moves.StringIO()
, so I guess at no point are we writing bytes, but rather strings.
For tar we are going through the Folder
class' open
method, which is just a proxy for io.open()
, I believe, and then the whole folder tree that has been created on the file system will be tarred and gzipped at the end.
For tar we are going through the
Folder
class'open
method, which is just a proxy forio.open()
, I believe, and then the whole folder tree that has been created on the file system will be tarred and gzipped at the end.
This is my point though, if you call open
with w
but then write bytes, it is going to crash. So doing
with folder.open('data.json', mode='w') as fhandle:
fhandle.write(json.dumps(data).encode('utf-8'))
will perhaps work if folder
sneakily is a zipfolder, but fail spectacularly if it is a normal sandbox folder.
Just a follow up, with the latest version of the code the export seems to work fine. We tested a small export/import, now we have a big export file being imported by Giovanni, let's see how that goes and possibly close the issue right after.
I confirm that the file Antimo exported works correctly, I could import it and everything is there and correct.
Antimo, could you please post your git diff
output to make sure we apply the correct change?
Sure!
git diff aiida/tools/importexport/dbexport/__init__.py
diff --git a/aiida/tools/importexport/dbexport/__init__.py b/aiida/tools/importexport/dbexport/__init__.py
index d27b1be..8c2b5f7 100644
--- a/aiida/tools/importexport/dbexport/__init__.py
+++ b/aiida/tools/importexport/dbexport/__init__.py
@@ -414,7 +414,7 @@ def export_tree(
# N.B. We're really calling zipfolder.open (if exporting a zipfile)
with folder.open('data.json', mode='w') as fhandle:
# fhandle.write(json.dumps(data, cls=UUIDEncoder))
- fhandle.write(json.dumps(data))
+ fhandle.write(json.dumps(data).encode('utf-8'))
# Add proper signature to unique identifiers & all_fields_info
# Ignore if a key doesn't exist in any of the two dictionaries
I don't understand how that diff makes it work, if anything it should fail. The handle fhandle
is opened in text mode but the write
call receives bytes, because the json.dumps
response is encoded. This should crash... so I am confused
I don't understand how that diff makes it work, if anything it should fail. The handle
fhandle
is opened in text mode but thewrite
call receives bytes, because thejson.dumps
response is encoded. This should crash... so I am confused
It does indeed fail for Py3.
Hi @CasperWA @sphuber Here what I think is a minimal example reproducing the problem:
#encoding: utf-8
import os
import six
import zipfile
from aiida.tools.importexport.dbexport.zip import ZipFolder
from aiida.common import json
utfstring = u'aà'
with ZipFolder('test.zip', mode='w', use_compression=True) as folder:
with folder.open('data.json', mode='w') as fhandle:
fhandle.write(json.dumps(utfstring))
Comments:
The fix reported by Antimo:
- fhandle.write(json.dumps(data))
+ fhandle.write(json.dumps(data).encode('utf-8'))
Is clearly failing as reported for python 3, but it works and does the correct thing (!) for py2, I guess due to the confusion of bytes vs. strings in py2.
Important note: if we were to use the standard
json
rather than our json, this could be solved in py2 by adding ensure_ascii=True
to the dumps
function - instead we are using our own json
that uses ensure_ascii=False
, and we cannot override.
In addition, I think the the use we do internally of MyWritingZipFile complicates the situation (can we get rid of it?) because it opens the file as a string instead of the standard approach of opening a bytes stream. Maybe this was on purpose to have a common interface among Folder
objects?
In addition, I think the the use we do internally of MyWritingZipFile complicates the situation (can we get rid of it?) because it opens the file as a string instead of the standard approach of opening a bytes stream. Maybe this was on purpose to have a common interface among
Folder
objects?
This is the subject of one of my private branches, and indeed also PR #3242. I will pick this up as soon as I have time.
Closing this since it is unlikely to be relevant after the refactoring of the archive code
Here is the error, reported by @AntimoMarrazzo:
@sphuber @CasperWA do you agree that this would be a correct fix?
Changes: mode from
w
towb
, and adding.encode('utf-8')
. I think this should work both in py2 and py3?