aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
436 stars 189 forks source link

Error exporting some files (another Unicode issue) #3492

Closed giovannipizzi closed 2 years ago

giovannipizzi commented 5 years ago

Here is the error, reported by @AntimoMarrazzo:

STARTING EXPORT...
RETRIEVING LINKED NODES AND STORING LINKS...
STORING DATABASE ENTRIES...
Exporting a total of 342751 db entries, of which 227939 nodes.
STORING NODE ATTRIBUTES AND EXTRAS...
STORING GROUP ELEMENTS...
STORING DATA...
Traceback (most recent call last):
  File "/home/marrazzo/.virtualenvs/migrate_scdm/bin/verdi", line 10, in <module>
    sys.exit(verdi())
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/cmdline/utils/decorators.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/cmdline/commands/cmd_export.py", line 138, in create
    export_function(entities, outfile=output_file, **kwargs)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/tools/importexport/dbexport/__init__.py", line 92, in export_zip
    export_tree(what, folder=folder, silent=silent, **kwargs)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/tools/importexport/dbexport/__init__.py", line 417, in export_tree
    fhandle.write(json.dumps(data))
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/tools/importexport/dbexport/zip.py", line 47, in __exit__
    self.close()
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/tools/importexport/dbexport/zip.py", line 39, in close
    self._zipfile.writestr(self._fname, self._buffer.read())
  File "/usr/lib/python2.7/zipfile.py", line 1236, in writestr
    zinfo.CRC = crc32(bytes) & 0xffffffff       # CRC-32 checksum
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc9' in position 281806205: ordinal not in range(128)

@sphuber @CasperWA do you agree that this would be a correct fix?

    # N.B. We're really calling zipfolder.open
    with folder.open('data.json', mode='wb') as fhandle:
        # fhandle.write(json.dumps(data, cls=UUIDEncoder))
        fhandle.write(json.dumps(data).encode('utf-8'))

Changes: mode from w to wb, and adding .encode('utf-8'). I think this should work both in py2 and py3?

CasperWA commented 5 years ago

I am unsure whether this will break the import, i.e., the unpacking of the archives. Can you not "open" the folder passing encoding='utf8', or will this not work with py2?

giovannipizzi commented 5 years ago

@CasperWA I think the effect of what you are suggesting and what I am doing should be the same?

CasperWA commented 5 years ago

@CasperWA I think the effect of what you are suggesting and what I am doing should be the same?

It may be. I am simply unsure what effect the change from w to wb will have on the import - probably none, but again, I am unsure.

AntimoMarrazzo commented 5 years ago

Actually the suggestion of Giovanni produces the following error

STARTING EXPORT...
RETRIEVING LINKED NODES AND STORING LINKS...
STORING DATABASE ENTRIES...
Exporting a total of 295041 db entries, of which 196012 nodes.
STORING NODE ATTRIBUTES AND EXTRAS...
STORING GROUP ELEMENTS...
STORING DATA...
Traceback (most recent call last):
  File "/home/marrazzo/.virtualenvs/migrate_scdm/bin/verdi", line 10, in <module>
    sys.exit(verdi())
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/local/lib/python2.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/cmdline/utils/decorators.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/cmdline/commands/cmd_export.py", line 138, in create
    export_function(entities, outfile=output_file, **kwargs)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/tools/importexport/dbexport/__init__.py", line 92, in export_zip
    export_tree(what, folder=folder, silent=silent, **kwargs)
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/tools/importexport/dbexport/__init__.py", line 415, in export_tree
    with folder.open('data.json', mode='wb') as fhandle:
  File "/home/marrazzo/.virtualenvs/migrate_scdm/aiida-core/aiida/tools/importexport/dbexport/zip.py", line 106, in open
    return self._zipfile.open(self._get_internal_path(fname), mode)
  File "/usr/lib/python2.7/zipfile.py", line 940, in open
    raise RuntimeError, 'open() requires mode "r", "U", or "rU"'
RuntimeError: open() requires mode "r", "U", or "rU"
giovannipizzi commented 5 years ago

Indeed, I was assuming that open of zip works like a standard open, but actually it always open a binary file (at least in python 3): Access a member of the archive as a binary file-like object (see here).

So we just need to encode, without the need of opening as 'b'. I hope this works also in python2... (anyway we'll try it now):

    # N.B. We're really calling zipfolder.open
    with folder.open('data.json', mode='w') as fhandle:
        # fhandle.write(json.dumps(data, cls=UUIDEncoder))
        fhandle.write(json.dumps(data).encode('utf-8'))
sphuber commented 5 years ago

What if the export is not done to a zip folder but the .tar.gz? That will not go through the MyZipFolder class, correct?

CasperWA commented 5 years ago

Thanks for the testing @AntimoMarrazzo. For zip files, only "w" is allowed, because we check for it in our own ZipFolder class and return an instance of our own MyWritingZipFile class.

Another thing: The encoding as done by @giovannipizzi above, is the only way to ensure UTF-8 encoding, since the open function does not have an encoding parameter (for our own ZipFolder).

@sphuber yes, if saving as a tar file, this is all water under the bridge.

sphuber commented 5 years ago

@sphuber yes, if saving as a tar file, this is all water under the bridge.

As in the original issue is not a problem there, or it is? I think if there you call open(.., 'w') and then try to write bytes to it, it will fail. The fact that our MyZipFolder ignores the mode and silently always opens a byte stream is another problem.

CasperWA commented 5 years ago

(...) The fact that our MyZipFolder ignores the mode and silently always opens a byte stream is another problem.

ZipFolder does not ignore mode, it checks whether "w" is passed and returns an instance of MyWritingZipFile if true, otherwise it uses zipfile.open(..., mode=mode). In MyWritingZipFile a file is written to the zipfile using zipfile.writestr() passing a six.moves.StringIO(), so I guess at no point are we writing bytes, but rather strings. For tar we are going through the Folder class' open method, which is just a proxy for io.open(), I believe, and then the whole folder tree that has been created on the file system will be tarred and gzipped at the end.

sphuber commented 5 years ago

For tar we are going through the Folder class' open method, which is just a proxy for io.open(), I believe, and then the whole folder tree that has been created on the file system will be tarred and gzipped at the end.

This is my point though, if you call open with w but then write bytes, it is going to crash. So doing

 with folder.open('data.json', mode='w') as fhandle:
        fhandle.write(json.dumps(data).encode('utf-8'))

will perhaps work if folder sneakily is a zipfolder, but fail spectacularly if it is a normal sandbox folder.

AntimoMarrazzo commented 5 years ago

Just a follow up, with the latest version of the code the export seems to work fine. We tested a small export/import, now we have a big export file being imported by Giovanni, let's see how that goes and possibly close the issue right after.

giovannipizzi commented 5 years ago

I confirm that the file Antimo exported works correctly, I could import it and everything is there and correct. Antimo, could you please post your git diff output to make sure we apply the correct change?

AntimoMarrazzo commented 5 years ago

Sure!

 git diff aiida/tools/importexport/dbexport/__init__.py
diff --git a/aiida/tools/importexport/dbexport/__init__.py b/aiida/tools/importexport/dbexport/__init__.py
index d27b1be..8c2b5f7 100644
--- a/aiida/tools/importexport/dbexport/__init__.py
+++ b/aiida/tools/importexport/dbexport/__init__.py
@@ -414,7 +414,7 @@ def export_tree(
     # N.B. We're really calling zipfolder.open (if exporting a zipfile)
     with folder.open('data.json', mode='w') as fhandle:
         # fhandle.write(json.dumps(data, cls=UUIDEncoder))
-        fhandle.write(json.dumps(data))
+        fhandle.write(json.dumps(data).encode('utf-8'))

     # Add proper signature to unique identifiers & all_fields_info
     # Ignore if a key doesn't exist in any of the two dictionaries
sphuber commented 5 years ago

I don't understand how that diff makes it work, if anything it should fail. The handle fhandle is opened in text mode but the write call receives bytes, because the json.dumps response is encoded. This should crash... so I am confused

CasperWA commented 5 years ago

I don't understand how that diff makes it work, if anything it should fail. The handle fhandle is opened in text mode but the write call receives bytes, because the json.dumps response is encoded. This should crash... so I am confused

It does indeed fail for Py3.

giovannipizzi commented 5 years ago

Hi @CasperWA @sphuber Here what I think is a minimal example reproducing the problem:

#encoding: utf-8
import os
import six
import zipfile

from aiida.tools.importexport.dbexport.zip import ZipFolder
from aiida.common import json

utfstring = u'aà'

with ZipFolder('test.zip', mode='w', use_compression=True) as folder:
    with folder.open('data.json', mode='w') as fhandle:
        fhandle.write(json.dumps(utfstring))

Comments:

The fix reported by Antimo:

-        fhandle.write(json.dumps(data))
+        fhandle.write(json.dumps(data).encode('utf-8'))

Is clearly failing as reported for python 3, but it works and does the correct thing (!) for py2, I guess due to the confusion of bytes vs. strings in py2.

Important note: if we were to use the standard json rather than our json, this could be solved in py2 by adding ensure_ascii=True to the dumps function - instead we are using our own json that uses ensure_ascii=False, and we cannot override.

In addition, I think the the use we do internally of MyWritingZipFile complicates the situation (can we get rid of it?) because it opens the file as a string instead of the standard approach of opening a bytes stream. Maybe this was on purpose to have a common interface among Folder objects?

CasperWA commented 5 years ago

In addition, I think the the use we do internally of MyWritingZipFile complicates the situation (can we get rid of it?) because it opens the file as a string instead of the standard approach of opening a bytes stream. Maybe this was on purpose to have a common interface among Folder objects?

This is the subject of one of my private branches, and indeed also PR #3242. I will pick this up as soon as I have time.

CasperWA commented 4 years ago

https://github.com/aiidateam/aiida-core/pull/3537#issuecomment-628537815

sphuber commented 2 years ago

Closing this since it is unlikely to be relevant after the refactoring of the archive code