ProjectQ-Framework / FermiLib

FermiLib: Open source software for analyzing fermionic quantum simulation algorithms
https://projectq.ch/
Apache License 2.0
85 stars 39 forks source link

Use marshal for save/load operators to improve performance #124

Closed Spaceenter closed 7 years ago

Spaceenter commented 7 years ago

@babbush @jarrodmcc @idk3

idk3 commented 7 years ago

@damiansteiger the first two points are true but the second isn't, see e.g. https://stackoverflow.com/questions/26931919/marshal-unserialization-not-secure - marshal doesn't allow you to insert arbitrary code (as e.g. pickle / eval would) and there are no known exploits for it.

The main reasons we want to change from h5py to marshal are speed (factor ~5 faster loading, and ~10+ saving), but more significantly memory usage. In the tests I've done with saving/loading larger operators (with >10 M terms), the conversion to HDF5 blows up in memory, getting as high as ~30 GB for a 2 GB FermionOperator. marshal for the same file only uses ~8 GB by comparison (though I'm still interested in lighter solutions if you know of any!). This means that loading the kind of FermionOperators that we'd be interested in saving in the first place is almost impossible with HDF5.

babbush commented 7 years ago

@damiansteiger: while I appreciate your concerns, right now performance is significantly more important for us. While I see that it might be problematic for people to try to share data between different versions of python or even another language, that is much less of a concern than the fact that we cannot run calculations we need to run without this pull request, or something similar which dramatically brings down memory usage. Right now the security concerns appear to be hypothetical, as Ian points out. So before we make the new release Sunday night, we'd like to have this PR accepted.

idk3 commented 7 years ago

@Spaceenter The current failure is due to my mistake - izip isn't available in Python3 at all it seems. I think map and zip from future.builtins or future_builtins should do it.

edit: from future_builtins import map, zip does it.

damiansteiger commented 7 years ago

@Spaceenter Thanks for providing this solution to @babbush's and @idk3's performance problem. I am sure they will use your provided solution as it solves their problem especially as they don't have time to work out something different. :1st_place_medal:

@idk3 Thanks for pointing out the performance issues with the previous approach. This is extremely helpful to put this pull request into perspective (please always add comments why a certain pull request is made) and provides one argument why you would like to change to using marshal.

Let me explain in more detail why I have raised objections about using marshal:

  1. Python specific data format is / will be a real problem in the ProjectQ Framework, here is why: We have decided that the ProjectQ-Framework uses Python as it's main language and everything has to work in Python. However, as Python is slower than, e.g., C++, we implement code which is a performance bottleneck also in C++ (e.g. the simulator in ProjectQ). This gives the users a choice between the pure Python version or the faster C++ version. E.g. #13 wants to have a C++ implementation of FermionOperators and QubitOperators. If we choose a language independent format, then the faster C++ code can read and write independently without going through Python which would kill the performance.

  2. According to the own documentation marshal is not supposed to be python version or python implementation independent. Since the first release of ProjectQ and FermiLib we have spent a lot of time to make sure that we python version independent. Just because Travis CI passes for both Python2 and Python3 does not guaranteed that other Python implementations won't fail reading this format.

  3. If the maintainers of the marshal package think they need to warn about malicious code execution, I think it is a fair point to assume it is possible.

I am not yet convinced the performance argument outweighs any of these three points (maybe a little bit point 3 but for sure not 1 and 2). Did you try the following:

  1. Make a toy example problem which demonstrates your performance bottleneck (maybe somebody else browsing through the repo might have a solution...)
  2. Did you try to play around with the different options of h5py to increase performance? Ask an HDF5 expert...
  3. It may be worth figuring out if the performance bottleneck is JSON and it might be faster to create a custom datatype without going the route through JSON...

@babbush and @idk3 I can understand that you are under time pressure and we are all trying our best to make sure FermiLib works for everyone. This means that we don't accept pull request to solve a performance problem for a particular Google project, while sacrificing fundamental design principles of the ProjectQ-Framework which will affect all other users. This means (hopefully very rarely) that there will be larger discussions to give everyone time to discuss their arguments and potentially find a better solution. Nevertheless, one great advantage of our framework being open source is that time pressure is almost never a valid argument for shortcutting proper code review and finding a good solution because there is always the following solution:

For the time being, you could just copy this code (which uses marshal and solves your immediate performance problem) and solve your problem. Is there a specific reason why this util function has to be inside FermiLib? Just submit the code internally at Google and you can start using it or put this one function temporarily on your own GitHub account until we have found a solutions which works for everyone.

babbush commented 7 years ago

I understand and respect your objections. However, for us to keep developing FermiLib we need to be able to sometimes strike a compromise between functionality today and purity of engineering practices. Currently, FermiLib is implemented entirely in python and marshall is passing for all tested versions of python. While theoretically possible, there are no known security exploits in marshall. Simply put, that is good enough for us.

The solution is in some ways suboptimal so it makes sense to open an issue and eventually work towards a better solution. We can document what is going on with examples and wait for this hypothetical person who is browsing our repository to suggest something better. But currently the memory spikes associated with the current solution are completely unacceptable. We need a new release of FermiLib to happen Sunday night (as we discussed) and to include the marshall solution. Otherwise, we cannot check the new version into Google3 and that will cause us unacceptable delays. Considering that almost all FermiLib pull requests currently come from people involved in this project, I think this is a completely fair reason. Our projects are largely reasonable for moving FermiLib forward. So we need to be practical. This solution does work for everyone who actually uses the code today. Please accept this pull request in time for the new release.

If you want to discuss further, please call me.

damiansteiger commented 7 years ago

This is a open source project, therefore technical discussions should happen openly and not behind closed doors (e.g. phone calls) so that everyone using FermiLib can see why certain design decisions have been made.

Here is a compromise: I remove my code review so that @babbush can merge and solve his time critical project. This only happens because FermiLib is still in an alpha release and users should be aware that the code can change anytime. But my experience has taught me that it is much harder to get someone to write a nice solution solving this problem (which does not have objections 1-3) if there exists already a "(not) good enough" solution working for some people. Hence, @babbush are you willing to fix this problem before we make the first release?

Spaceenter commented 7 years ago

Thanks for the discussions.

I actually have another idea to make this work without making it Python specific. Inside Google, almost all large engineering project use protobuf as a no-brain choice to represent data model (complex or not), because it is designed to be working across different languages, and have much better serializations than things like json.

So instead of using native Python dictionary to represent FermionOperator and QubitOperator (and a few other objects like InteractionRDM, MoleculeOperator, etc.), we define a protobuf message for each of them. And then write the serialized protobuf binary to file. https://developers.google.com/protocol-buffers/docs/pythontutorial#parsing-and-serialization

This is obviously more complex than marshal, and we'd need some time to develop and test. @babbush @damiansteiger

babbush commented 7 years ago

@damiansteiger I agree we should try to keep discussions open. I also appreciate your point about it being more difficult to get people to fix issues when they are "mostly working" already and not badly in need of repair (as is the case now).

I will not personally promise to fix this myself. The first reason is because I am really not the right person to do this - I am not aware of any acceptable solutions and there are other people on this project who would be suitable (such as yourself or @Spaceenter). The second reason is because this is not a critical issue for me whereas there are many other features that I'd like to implement. But I will certainly encourage others to try to come up with a better solution before we move out of alpha. @Spaceenter, you just suggested moving to protocol buffers, which sounds like a reasonable fix to me. Is this something you could commit to trying to do in the next few weeks?

Hopefully this is enough for you to remove the block. I assume that if @Spaceenter agrees to help fix this then that will be good enough for you?

Spaceenter commented 7 years ago

Yes, I'll start being a bit more free later next week, I'll then try to send a PR with protobuf and work with @idk3 to test the performance. In my opinion other people who will use projectq/fermilib in the future will also benefit from protobuf as they might use Java, Go, JS etc. to access the operators, and having protobuf should be much easier. Anyways, we could discuss when I actually have the PR ready. @babbush @damiansteiger @idk3

babbush commented 7 years ago

Great, thank you @Spaceenter! I agree that moving to protocol buffers makes a lot of sense for the long run. I suppose this should be enough for @damiansteiger to merge this PR.