dgasmith / opt_einsum

⚡️Optimizing einsum functions in NumPy, Tensorflow, Dask, and more with contraction order optimization.
https://dgasmith.github.io/opt_einsum/
MIT License
863 stars 68 forks source link

UnicodeEncodeError in contract_path #182

Closed ChienKaiMa closed 2 years ago

ChienKaiMa commented 2 years ago

The contraction_info function in quimb returned UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 30337: surrogates not allowed in my circuit (a simple circuit with 40000-qubit entanglement). I traced the code and it led me to the contract_path function in opt_einsum. I found that the get_symbol function in parser.py might generate this error. Since '\ud800' is a surrogate (explained in https://www.informit.com/articles/article.aspx?p=2274038&seqNum=10), when we try to print input_subscripts in contract_path, the error is raised. This error can probably be solved by returning the surrogate's index.

get_symbol(55156)
#> '�'

However, I'm not sure if this solution might raise more problems, since this returned string is not a single character. If returning a string with string length > 1 is feasible, I would like to submit a pull request. If not, maybe we can skip the surrogates in get_symbol.