Closed hhcs9527 closed 1 year ago
Hi @hhcs9527
If we just replace the to_json / from_json function with DataClassJSONMixin ones, the speed remains the same
So that I can help you, could you provide more details about how you did it? I just replaced DataClassJsonMixin
from dataclasses-json with DataClassJSONMixin
from mashumaro and got significant improvement in from_json / to_json:
mashumaro from_json 1.439956832997268
dataclasses_json from_json 21.087273874989478
mashumaro to_json 1.4008548330166377
dataclasses_json to_json 13.31320291600423
the reason I believe is the difference in how dataclasses_json implements the to_dict / from_dict
AFAIK, dataclasses-json doesn't compile loader / dumper for specific schema as mashumaro does, which is what makes it so slow.
@Fatal1ty Thanks for the help. the following example shows the concept of my problem.
I have a object from @dataclass and DataClassJsonMixin. However, due to the speed issue, we would like to transform such object to DataClassJSONMixin.
But I have no idea how to retrieve the data in DataClassJsonMixin to DataClassJSONMixin => which I believe is the main reason that DataClassJSONMixin is fast.
I've tried several ways
Here is my example for way 2. speed : my_portfolio3 (mix one) > my_portfolio (DataClassJSONMixin) > my_portfolio2 (DataClassJsonMixin)
import json
from enum import Enum
from typing import List, Any, Union, Generic, TypeVar, Tuple, cast
from dataclasses import dataclass, field
from dataclasses_json import dataclass_json, DataClassJsonMixin
from mashumaro.mixins.json import DataClassJSONMixin
from mashumaro.types import SerializationStrategy
T = TypeVar("T")
def timeDiff(obj):
import time
start = time.time()
json_str = obj.to_json()
end = time.time()
print("to_json cost:", (end - start)*10000)
start = time.time()
type(obj).from_json(json_str)
end = time.time()
print("from_json cost:", (end - start)*10000)
print()
class Currency(Enum):
USD = "USD"
EUR = "EUR"
@dataclass
class CurrencyPosition(DataClassJSONMixin):
currency: Currency
balance: float
@dataclass
class StockPosition(DataClassJSONMixin):
ticker: str
name: str
balance: int
@dataclass
class Portfolio(DataClassJSONMixin):
currencies: List[CurrencyPosition]
stocks: List[StockPosition]
my_portfolio = Portfolio(
currencies=[
CurrencyPosition(Currency.USD, 238.67),
CurrencyPosition(Currency.EUR, 361.84),
],
stocks=[
StockPosition("AAPL", "Apple", 10),
StockPosition("AMZN", "Amazon", 10),
]
)
json_string = my_portfolio.to_json()
# print("my_portfolio og: ", json_string)
# print(json_string == Portfolio.from_json(my_portfolio.to_json()).to_json())
@dataclass_json
@dataclass
class CurrencyPosition2:
currency: Currency
balance: float
@dataclass_json
@dataclass
class Portfolio2():
currencies: List[CurrencyPosition]
stocks: List[StockPosition]
my_portfolio2 = Portfolio2(
currencies=[
CurrencyPosition(Currency.USD, 238.67),
CurrencyPosition(Currency.EUR, 361.84),
],
stocks=[
StockPosition("AAPL", "Apple", 10),
StockPosition("AMZN", "Amazon", 10),
]
)
timeDiff(my_portfolio)
timeDiff(my_portfolio2)
my_portfolio3 = Portfolio2(
currencies=[
CurrencyPosition(Currency.USD, 238.67),
CurrencyPosition(Currency.EUR, 361.84),
],
stocks=[
StockPosition("AAPL", "Apple", 10),
StockPosition("AMZN", "Amazon", 10),
]
)
Portfolio2.to_dict = DataClassJSONMixin.to_dict
Portfolio2.from_dict = DataClassJSONMixin.from_dict
Portfolio2.to_json = DataClassJSONMixin.to_json
Portfolio2.from_json = DataClassJSONMixin.from_json
timeDiff(cast(DataClassJsonMixin, my_portfolio3))
Hope this helps, thanks!
Hope this helps, thanks!
Yes, thank you, this example shed some light here, but you made a few mistakes. Let me explain.
Portfolio2.to_dict = DataClassJSONMixin.to_dict Portfolio2.from_dict = DataClassJSONMixin.from_dict Portfolio2.to_json = DataClassJSONMixin.to_json Portfolio2.from_json = DataClassJSONMixin.from_json
You can't simply replace class attributes to the ones from DataClassJSONMixin
class because the base class methods from_dict
/ to_dict
do nothing, they need to be built for Portfolio2
specifically. If you add print
statement to you timeDiff
you will see that print(obj.to_json())
is {}
and print(type(obj).from_json(json_str))
is <mashumaro.mixins.json.DataClassJSONMixin object at 0x...>
which is not you were expected and which is the reason why my_portfolio3 (mix one) is the fastest in your inaccurate but still a benchmark.
@dataclass_json @dataclass class Portfolio2(): currencies: List[CurrencyPosition] stocks: List[StockPosition]
The second point. You're mixing Portfolio2 with CurrencyPosition and StockPosition not with CurrencyPosition2 and StockPosition2. This works for dataclasses_json because it fortunately assumes your dataclasses have methods with the same names that mashumaro introduces. But if you're going to mix the libraries it will definitely be error prone.
my_portfolio (DataClassJSONMixin) > my_portfolio2 (DataClassJsonMixin)
I strongly believe this is a typo and you actually didn't get this result. I run your code on my laptop with Python 3.11 and got the following numbers:
to_json cost: 0.11205673217773438
from_json cost: 0.17881393432617188
to_json cost: 2.0599365234375
from_json cost: 2.582073211669922
to_json cost: 0.030994415283203125
from_json cost: 0.030994415283203125
So, mashumaro is roughly 18x faster on to_json
and 14x faster on from_json
.
I have a object from @DataClass and DataClassJsonMixin. However, due to the speed issue, we would like to transform such object to DataClassJSONMixin.
Based on all the above I'd like to ask you, why don't you just remove @dataclass_json
decorator or DataClassJsonMixin
base usage and replace it with DataClassJSONMixin
(or even DataClassORJSONMixin
for even better performance)?
You can't simply replace class attributes to the ones from
DataClassJSONMixin
class because the base class methodsfrom_dict
/to_dict
do nothing, they need to be built forPortfolio2
specifically
Just for the reference, it's not documented and subjected to changes but specific from_dict
/ to_dict
/ from_json
/ to_json
methods can be compiled and added to Portfolio2
by using internal CodeBuilder
that is used by mashumaro mixins. So, instead of this:
Portfolio2.to_dict = DataClassJSONMixin.to_dict
Portfolio2.from_dict = DataClassJSONMixin.from_dict
Portfolio2.to_json = DataClassJSONMixin.to_json
Portfolio2.from_json = DataClassJSONMixin.from_json
You're better to do this:
from mashumaro.core.meta.code.builder import CodeBuilder
cb = CodeBuilder(Portfolio2) # if you need to call from_dict / to_dict in your code
cb.add_pack_method()
cb.add_unpack_method()
cb = CodeBuilder(
Portfolio2, format_name="json", encoder=json.dumps, decoder=json.loads
) # if you need to call from_json / to_json in your code
cb.add_pack_method()
cb.add_unpack_method()
It will give Portfolio2
the same (de)serialization speed as for Portfolio
but in real life you will be forced to do this for all dataclasses you depend on until the related issue is solved:
Hi @Fatal1ty , Thanks for the advice, I just tried the methods, and it works! (which is the way I preferred, but the performance in the real-world case still has some issues)
Not sure what you mean by the last paragraph.
Thanks for the advice, I just tried the methods, and it works! (which is the way I preferred, but the performance in the real-world case still has some issues)
Can you share more details about your performance issues?
P.S. I found your recent changes in flytekit and I haven't delved deeply into the internals of flytekit, but as far as I understand, to_python_value
/ to_literal
are called every time on values of the same type. If this is the case, then I strongly recommend not to do so, since compilation is expensive and must be performed once per data type. You should cache it somehow. With the upcoming release you will be able to check if the data type doesn't have internal method __mashumaro_from_json__
/ __mashumaro_to_json__
and compile it if needed, but it's a dirty hack that has its own cost.
Anyway, I'm still struggling to understand why you're trying to mix dataclasses_json with mashumaro in this weird way? To help me better understand your motivation, I will quote my question:
Based on all the above I'd like to ask you, why don't you just remove @dataclass_json decorator or DataClassJsonMixin base usage and replace it with DataClassJSONMixin (or even DataClassORJSONMixin for even better performance)?
I found chapter "Using Custom Python Objects" in flytekit documentation that says that you support dataclasses with dataclass_json mixed functionality. So, why not extend dataclasses support either to dataclasses with @dataclass_json
decorator (or DataClassJsonMixin
since it's interchangeable) or to dataclasses with mashumaro mixins?
You're right, we should start at the bottom to replace the DataClassJsonMixin with DataClassJSONMixin. That's what I'm going to do, thanks for the help along the way. Thanks for the kind support and suggestion, also the amazing implementation for DataClassJSONMixin!
I think this issue is resolved, thanks.
By the way, can you point out the path where you implement the mashumaro_from_json for any type? Does it mean the compile, also does the compile happen in the create object process?
Thanks for the clarification.
By the way, can you point out the path where you implement the mashumaro_from_json for any type?
To be more accurate, method __mashumaro_from_json__
is assigned only to dataclass types, not to any type. CodeBuilder constructs the code of this method line by line specifically for a dataclass that it sees for the first time here. But I should emphasize that __mashumaro_*__
methods were introduced recently and are only available in master branch at this moment. I'm going to make a release in the next few days.
Does it mean the compile, also does the compile happen in the create object process?
The moment when compilation occurs depends on the use of the library. Let me elaborate this.
from_*
/ to_*
are being compiled and assigned to the dataclass at import time. So, when you have a reference to this dataclass type or dataclass instance you can use its from_*
/ to_*
methods.from_*
/ to_*
methods. To pull this off there will be one-shot method from_*
/ to_*
that will build the real method and replace itself with it. It can be dangerous if you keep a reference to this one-shot method because you will end up with repeated compilation on every call. To avoid this it's recommended either not to save a reference to a dataclass method, or use a lambda function just in case.add_pack_method
/ add_unpack_method
your dataclass will get new methods compiled and assigned. So, it's not recommended to do it every time in the name of performance. Also, you can be interested in this issue:
While it's not done yet you can add support for dataclasses without any mixins as follows:
import json
from dataclasses import dataclass
from typing import Type, TypeVar
from mashumaro.core.meta.code.builder import CodeBuilder
T = TypeVar("T")
class JSONDecoder:
def __init__(self):
self._cache = {}
def from_json(self, type_: Type[T], data: str) -> T:
type_id = id(type_)
method = self._cache.get(type_id)
if method:
return method(data)
cb = CodeBuilder(type_, format_name="json", decoder=json.loads)
cb.add_unpack_method()
method = getattr(type_, "__mashumaro_from_json__")
self._cache[type_id] = method
return method(data)
def from_json2(self, type_: Type[T], data: str) -> T:
method = getattr(type_, "__mashumaro_from_json__")
if method:
return method(data)
cb = CodeBuilder(type_, format_name="json", decoder=json.loads)
cb.add_unpack_method()
method = getattr(type_, "__mashumaro_from_json__")
return method(data)
It has a small additional performance cost but is still faster than dataclasses-json:
Variant without mixins and with getattr
turned out to be even faster for json comparing to using DataClassJSONMixin
. It happens because DataClassJSONMixin
doesn't compile a specific from_json
method (as DataClassORJSONMixin does), instead it calls a specific compiled from_dict
method underhood.
without mixins (1) 1.086791625013575
without mixins (2) 1.0074725000013132
DataClassJSONMixin 1.0947887079964858
DataClassJsonMixin 13.494744666997576
I hope this will help you make the right decision.
P.S. You are free to close this issue if everything was answered.
Thanks for the explanation. Does mashumaro support generated data class from its own JSON schema?
Does mashumaro support generated data class from its own JSON schema?
If you mean dataclass code generation from JSON schema, then answer is no. It’s a job for another library as I see it.
Do you know which library? Thanks.
Do you know which library? Thanks.
Take a look at datamodel-code-generator
Is your feature request related to a problem? Please describe. Currently, our project tried to migrate the serialize function in dataclasses_json to DataClassJSONMixin. If we just replace the to_json / from_json function with DataClassJSONMixin ones, the speed remains the same, the reason I believe is the difference in how dataclasses_json implements the to_dict / from_dict.
Describe the solution you'd like It would be great to have a function to transform the dataclasses_json to DataClassJSONMixin.
Describe alternatives you've considered or Provide a general way to accept the init data
Additional context Add any other context about the feature request here.