Closed Ocramius closed 1 year ago
I actually just realized that having a separate UnitOfWork
for each fetched resultset would provide this functionality magically and quickly. This also means that we can't use array
for resultsets anymore, but that should be quite ok.
If I understand this correctly, you'd like to add save()
method that would persist & flush just the passed entity?
If so, I like it.
@TomasVotruba pretty much, except that there is no flush()
, not even internally.
With these semantics, $someRepository->find(123) !== $someRepository->find(123);
This implies some kind of version mechanism. Doesn't look nice from what I know thus far.
@Akii can you elaborate a bit? There are no "versions", every time you call the repository, a new query is executed...
@Ocramius my idea of an repository is aligned to DDD. So when I ask it to give me an aggregate with ID x I assume it to be the same each time I do, regardless of whatever underlying storage is used.
@Akii if that's the case, then you'd still use the UnitOfWork
backed API.
As for my versioning concern: $a = $repo->find(1); $b = $repo->find(1);
$repo->save($a); $repo->save($b);
This will then just commit $b? (as in write a, then overwrite with b)
(I thought this was some kind of replacement for the UnitOfWork, if that's still there I don't have any objections of course)
This will then just commit $b? (as in write a, then overwrite with b)
Unless you use a locking mechanism (as described in http://doctrine-orm.readthedocs.org/projects/doctrine-orm/en/latest/reference/transactions-and-concurrency.html ), then yes, it will silently UPDATE
Okay, the advantages are still not quite clear to me though (no troll intended, speaking of which.. you didn't today!).
I can see that you might have less complexity but that's solved already. And okay you can diff things, but I'm sure I'm missing the most points.
Not necessarily bad, but wouldn't it require knowledge of all objects ever changed? With UnitOfWork, I write module which may internally do some entity manipulations. Coworker writes their own module that use my as foundation with some more entity mutations. To avoid DB overload we agree to just do flush() at the end of the request. Will that still be possible?
@Akii updated OP
@alextech yes - this separate API would be completely separate and opt-in.
In that case not bad. I like the "long running processes" part especially. It is more natural to do $repository->save($obj) where $obj came from message bus subscription, for example, where in the queue subscriber there is not really a concept of full session object graph that needs committing/flushing.
@Akii Since you mention DDD, one wrinkle I've been concerned with is that versions are per-row/entity, rather than per-aggregate-root. This means that without extra workarounds, you can have collisions between subtrees. For example, consider a Plane
entity that has two Engine
entities, and all of them have versions.
$planeA = $repo->find(1); // Two-engine aircraft
$planeB = $repo->find(1);
$planeA->leftEngine->setInactive();
$planeB->rightEngine->setInactive();
assert($planeA->hasPower() == true);
assert($planeB->hasPower() == true);
// No conflicts because neither individual entity was touched twice
$repo->save($planeA);
$repo->save($planeB);
assert($repo->find(1)->hasPower() == false); // Oh no, we're gonna crash
@DHager I might be missing something, but I think in your example you first save $planeA and after that $planeB, which overwrites $planeA. So at the end you have this:
assert($repo->find(1)->hasPower() == true); // It still has the left engine from $planeB
It would go wrong in your example if it would only update the changed engine state, which is not the case since you save the entire object.
This seems an interesting idea! The identity behavior change may not be a problem... If I understood it well,
$someRepository->find(123) !== $someRepository->find(123);
but
$someRepository->find(123) == $someRepository->find(123);
Right? Two different instances, but with same values?
@DHager Good point. However @xtreamwayz should be right as it says "everything is treated dirty". So cascade persist would simply overwrite left engine when persisting B.
@Jean85 My gut feeling tells me that this might not be the case due to proxies (lazy loading etc.).
@Ocramius thanks for pointing out the advantages, makes really much sense in certain areas. Migrations could potentially also benefit from it.
@Akii shouldn't proxies be triggered by a comparison like that? At least, the danger is to recursively hydrate a lot of related entities..
@Ocramius Say you could have multiple unit of works, wrapped in something like a "Transaction" object, then if its possible to replace the change set computation strategy, then this might be easily possible without changing "alot" of code.
@beberlei another thing I just checked is that we actually may need multiple UnitOfWork
instances anyway. In a "semi-detached" graph, following example doesn't really work:
class Company {
public $employees = [];
}
class Employee {
public $company;
}
SELECT c FROM Company c
public function testInternalReferencesAreConsistent()
{
foreach ($tdg->select('SELECT c FROM Company c') as $company) {
foreach ($company->employees as $employee) {
self::assertSame($company, $employee->company);
}
}
}
This sort of internal references need to be consistent, of course (also across multiple hops, which aren't hydrated in the first hydration step), and to do so we'd need a "detached UnitOfWork
" inside any proxy or collection inside the graph spawned by the initial query SELECT c FROM Company c
.
Note that one unit of work per subgraph doesn't actually complicate things, and may actually optimize update operations as well...
It's like I'm really using Doctrine 1.
Can you elaborate? Doctrine 1 is an active-record implementation: this has nothing to do with that.
I would also :heart: to have such an implementation that does not rely on the UoW.
Basically, when doing performance sensitive, hightly concurrent stuff, I prefer to revert to plain SQL queries to ensure that my entities are read and written exactly how and when I need it.
Also, in some batch scripts, I do not want to load a lot of objects into memory, yet I potentially want to perform a large job in a single transaction, so the UoW gets in the way, as the identity map keeps a reference to all the loaded objets, and prevents the garbage collector from doing its job. Yes, you can flush and clear the EntityManager along the way, but it's just a workaround.
I would love to be able to work with objects only (no more SQL queries in my code), and have a lower level API for reading/writing objects, while purposely losing the advantages that come with the unit of work / identity map.
Is this still in the works? Would it be a thing for Doctrine 3.0?
This is not currently being worked on: I have no throughput for OSS these months, as I'm mostly working to save some $$$ first.
On Thu, 15 Nov 2018, 14:42 Benjamin Morel <notifications@github.com wrote:
I would also ❤️ to have such an implementation that does not rely on the UoW.
Basically, when doing performance sensitive, hightly concurrent stuff, I prefer to revert to plain SQL queries to ensure that my entities are read and written exactly how and when I need it.
Also, in some batch scripts, I do not want to load a lot of objects into memory, yet I potentially want to perform a large job in a single transaction, so the UoW gets in the way, as the identity map keeps a reference to all the loaded objets, and prevents the garbage collector from doing its job. Yes, you can flush and clear the EntityManager along the way, but it's just a workaround.
I would love to be able to work with objects only (no more SQL queries in my code), and have a lower level API for reading/writing objects, while purposefully losing the advantages that come with the unit of work / identity map.
Is this still in the works? Would it be a thing for Doctrine 3.0?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/doctrine/doctrine2/issues/5550#issuecomment-439043750, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJakK-UgJBb1_xonWsq3H9Upl-apmcUks5uvW80gaJpZM4Gz5Cp .
I'd like to start playing with this idea someday.
Just one question: I'm perfectly happy with the fact that there is no "global" identity map (i.e. $repo->find(123) !== $repo->find(123)
). I'm wondering, however, what should happen when loading an entity that has twice a reference to another object (in the entity itself or in an eager-loaded relationship), should the two objects be the same instance or would that not even be guaranteed?
A similar question comes to mind, when you load an object with criteria including existing entities, should the loader re-use these objects instead of loading them again and create new ones?
Example: say we have an object whose identity is a couple of objects:
class Follow
{
/** @var User */
private $follower;
/** @var User */
private $followee;
}
If I do something like:
$repo->find([
'follower' => $user1,
'followee' => $user2
]);
Would $follower === $user1
and $followee === $user2
? Or would it load new objects, just using the existing objects' identities?
What I had in mind is that the loader/repo/whatever could have a transient identity map that only exists for the current loading operation.
This way, when an object is referenced twice in an objet's graph, they're guaranteed to be the same, for this operation only.
And when you search objects by providing other objects, as in the example above, these object instances are included in the loaded object (the loader would add them to a blank identity map before starting to load the object graph).
What do you think?
@BenMorel I think the identity within the set of loaded records, as well as eventual lazy-loading within it, should be preserved. The current simplistic solution is to have one UnitOfWork
per loaded set, but that's very inefficient, so I didn't explore further until I got a better idea.
@Ocramius Thanks for your reply, we're on the same page here.
Another thing I had in mind, is where do we store the state of the entity?
For entities that have a GeneratedValue identity, it's easy: if the value is null, then the entity is NEW.
However, for entities having a natural identifier, it's tougher: we don't know if we should INSERT
or UPDATE
without hitting the DB first (so, we potentially have to hit the DB twice per save()
call).
We can't store the state of the entity in the repo, as this would defeat the whole purpose of this feature (leaking memory); and we can hardly store the state in the entity itself (we'd be polluting the entity with dynamically created fields, which may hurt serialization and maybe other things).
Same thing for dirty checks: because we have nowhere to store the original values, we cannot compare the current values before writing to the DB, so we'd have to write all the fields every time; I'm not sure whether this is really a problem, but this is something I don't like for sure.
Do you have any ideas on this matter?
You can prevent hitting the DB twice by performing a combined INSERT & UPDATE using the ON DUPLICATE KEY UPDATE syntax having all fields set in the UPDATE clause. That way only one operation would be needed even without knowing the state in the database.
(Slightly off-topic but kind of same direction: Two years ago i experimented with changing doctrine's behavior to better handle batch-processing, transactions & rollbacks without changing the current flush semantics. It works by having one UOW per transaction-level containing the state of entities at the start of the transaction. That way it is easy to revert / clean the UOW at the end of each batch-iteration, also it encourages working with transactions. Maybe you want to take a look at that.)
Thanks for your thoughts, @addiks; your experiment looks really interesting!
Regarding ON DUPLICATE KEY UPDATE
, I thought about this, but there are two caveats:
I am not sure I understand; does this mean identity-map pattern will be gone?
That's a big game changer. IMP allowed me to change my entities wherever I wanted in the code. To split complex situations, I make tagged services where sometimes they will work on same entity. And if they change it, I must be sure all changes are saved to DB, not just the last one executed.
I know it is very rare and maybe my explanation is bad but I have working example. Site is crazy complex, and I can use logic only from Many2One side which means I need to trigger queries.
i.e. I can't use $entity->getAssociation()
but only $assocRepository->findForEntity($entity)
.
If interested, I can describe the problem. But it really is crazy situation :smiley:
PS: I am not talking about optimization like if I search by ID, Doctrine will return one from memory w/o hitting database; it is amazing, but I can live w/o that. I am only interested in having identical objects all the time.
Will it be possible to keep IMP, even at the cost of performance?
The idea is precisely to kill IMP for unrelated aggregate operations.
Separate aggregates should be handled separately, in separate transactions and UnitOfWork instances.
On Thu, 3 Jan 2019, 05:36 Zeljko Mitic <notifications@github.com wrote:
I am not sure I understand; does this mean identity-map pattern will be gone?
That's a big game changer. IMP allowed me to change my entities wherever I wanted in the code. To split complex situations, I make tagged services where sometimes they will work on same entity. And if they change it, I must be sure all changes are saved to DB, not just the last one executed.
I know it is very rare and maybe my explanation is bad but I have working example. Site is crazy complex, and I can use logic only from Many2One side which means I need to trigger queries.
i.e. I can't use $entity->getAssociation() but only $assocRepository->findForEntity($entity).
If interested, I can describe the problem. But it really is crazy situation 😃
PS: I am not talking about optimization like if I search by ID, Doctrine will return one from memory w/o hitting database; it is amazing, but I can live w/o that. I am only interested in having identical objects all the time.
Will it be possible to keep IMP, even at the cost of performance?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/doctrine/doctrine2/issues/5550#issuecomment-451054354, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJakDcZfopXwa74s5tbGLzanCymuilGks5u_Yi1gaJpZM4Gz5Cp .
But is it possible to keep IMP and add this as new feature? From my POV, few ms faster is not as important as having cleanly separated code.
If of any relevance, I have hobby project that persists 100 millions rows table, using only entities (no plain SQL or similar). Top speed (5 column entity) was around 9000/second, done in batches of 4000, indexes turned off till script finishes.
Which I say is pretty impressive, I would never think it would be possible in PHP. Is it that important to have faster writes with this new idea but loose flexibility of IMP, when it already is fast?
As I mentioned above, this is precisely about removing IMP. Each fetched record is independent from the others, by design, in order to keep state isolated between separate retrieved aggregates.
This API is not about batch operations, but about isolation of object graphs.
@zmitic As I understand it, this new API would not replace the current Doctrine behaviour with Unit Of Work and Identity Map, it would just be another, optional way to use Doctrine for particular use cases.
@BenMorel I thought that at first but later comments confused me. If it is extra, that would be great, as long as UOW and IMP can be used as now.
* I'm not sure whether this is portable: do all database platforms have a similar mechanism?
I just stumbled over the answer to this, in other DBMS's this is called UPSERT or MERGE:
https://en.wikipedia.org/wiki/Merge_(SQL)
It looks like pretty much every important DBMS has this concept or something similar, but it is non-standard and every DBMS uses it's own Syntax: UPSERT / MERGE INTO / INSERT ON DUPLICATE UPDATE / INSERT ON CONFLICT / INSERT OR REPLACE / WHEN MATCHED / ...
I feel like I don't quite understand the vision behind this issue. The very-similar-sounding thing I wish existed would simply be control over switching which UoW/IdentityMap is currently being used, rather than having a very different API that bypasses the global implicit ones.
That way the underlying problem of "how much isolation is needed and between which parts" gets moved "up" the code-stack to a level of abstraction where the developer has enough domain-knowledge to decide.
This isn't a great API, but to illustrate the idea:
// Previous normal behavior
$entity = $someRepository->find(123);
$entity->doSomething();
$entityManager->flush();
// Rough cut of UoW-switching approach
$temporaryContext = $entityManager->createContext(); // Blank UoW/IdMap/etc.
$originalContext = $entityManager->switchContext(temporaryContext);
$entity = $someRepository->find(123);
$entity->doSomething();
$entityManager->flush();
$entityManager->switchContext($originalContext);
// Memory gets reclaimed when $temporaryContext goes out of scope
@addiks That's what I've seen too, it looks like some kind of UPSERT is available in pretty much every RDBMS. My other concern still stands, though: there is no way, at least in MySQL, to know whether it's the PRIMARY KEY that was a duplicate, or another unique key.
For example, attempting to upsert a record with a primary key and a unique email field, doing ON DUPLICATE KEY UPDATE
might update a record with the same email but a different primary key, which is probably not what we want here.
As far as I can see, there is no way around this in MySQL, so the only portable way seems to be to UPDATE
and check the number of matched rows, if zero then INSERT
.
@DHager The idea here was precisely to avoid the overhead of a UnitOfWork / IdentityMap for simple use cases when you want to maximize throughput, and don't need the guarantee that entities with the same identity will be the same object. This would be reserved for basic CRUD operations.
After playing a bit with the idea though, I noticed that as soon as your entities start having relationships to each other, this starts being kind of a mess, and the hopes of simplification that this idea brought quickly vanish:
getId()
to your code: if ($a->getId() === $b->getId())
I couldn't see how much I was tied to the current Doctrine behaviour, even for the simplest use cases, until I tried to get rid of it. Lesson learned; the approach suggested here is therefore probably only good for entities with next to no relationships (only scalars), which is rarely what we have.
I think that your approach is a good one, this is pretty much what I have in mind now as well: a disposable UnitOfWork.
We can already achieve almost the same thing using EntityManager::clear()
, or creating multiple EntityManager
instances. I'm not sure what overhead this carries though, when doing this in batches.
I'm not really worried much about performance that much, but rather isolation of state between aggregates: each loaded object gets its own object graph.
@Ocramius
As I mentioned above, this is precisely about removing IMP. Each fetched record is independent from the others, by design, in order to keep state isolated between separate retrieved aggregates.
This API is not about batch operations, but about isolation of object graphs.
I had the same utopia regarding aggregates, but the thing is, an entity inside your aggregate might still reference another aggregate root, so you still need an identity map in case several of your entities inside your aggregate point to the same external aggregate root (plus you'll need an identity map inside your aggregate, too).
There is probably room for optimization, however, by deliberately saving a single entity, and not cascading save to other aggregates. This is in line with your last comment as well: providing isolation of state between aggregates.
But getting rid of the identity map, I think that this is not feasible.
@addiks That's what I've seen too, it looks like some kind of UPSERT is available in pretty much every RDBMS. My other concern still stands, though: there is no way, at least in MySQL, to know whether it's the PRIMARY KEY that was a duplicate, or another unique key.
I just fiddled around with it and i may have found a way, but it is not pretty:
INSERT INTO new_table
(id, name, email)
VALUES
(456, 'John Jimmy Doe', 'john.doe@example.com')
ON DUPLICATE KEY
UPDATE name=IF(id != VALUES(id), name, VALUES(name));
Complete demonstration SQL: https://gist.github.com/addiks/b3f507a45dc12e9f6b09b7b42740c823
I just fiddled around with it and i may have found a way, but it is not pretty
Clever, but you still have to issue a second query if the duplicate occurs on a unique key (in which case your statement is a no-op and you want to insert another row).
@Ocramius would this be somewhat similar to Stateless Sessions in Hibernate: http://docs.jboss.org/hibernate/orm/5.4/userguide/html_single/Hibernate_User_Guide.html#_statelesssession
Yes, seems similar
On Sun, 7 Apr 2019, 09:05 RJ Garcia, notifications@github.com wrote:
@Ocramius https://github.com/Ocramius would this be somewhat similar to Stateless Sessions in Hibernate: http://docs.jboss.org/hibernate/orm/5.4/userguide/html_single/Hibernate_User_Guide.html#_statelesssession
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/doctrine/orm/issues/5550#issuecomment-480565028, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJakGbSI78jusjendHoJk2sPeuRnsojks5veZjRgaJpZM4Gz5Cp .
Awesome, ya, this type of feature would be incredibly helpful when it comes to batch processing, and also logging entities to the db. Doctrine's biggest crutch imho is not being able to gracefully handle db exceptions and the UoW has to get cleared which makes any type of batch processing a nightmare to recover from in those scenarios.
This is a rough PoC, but I think something like this would suffice. I'd imagine a lot of this could be standardized more/less. I got this code from looking at how the BasicEntityPersister works. This example is for storing a mapping entity which maps ids from a host system to a different system for products and their variants.
final class MappedHostProductPersister
{
private $em;
public function __construct(EntityManagerInterface $em) {
$this->em = $em;
}
/** @param MappedHostProduct[] $mappedProducts */
public function save(array $mappedProducts) {
$conn = $this->em->getConnection();
$mhpClassMeta = $this->em->getClassMetadata(MappedHostProduct::class);
$mhpvClassMeta = $this->em->getClassMetadata(MappedHostProductVariant::class);
$conn->transactional(function(Connection $conn) use ($mappedProducts, $mhpClassMeta, $mhpvClassMeta) {
foreach ($mappedProducts as $mhp) {
[$data, $types] = $this->mappedHostProductToInsertData($mhpClassMeta, $mhp);
$conn->insert($mhpClassMeta->getTableName(), $data, $types);
$mhpClassMeta->setFieldValue($mhp, 'id', (int) $conn->lastInsertId());
foreach ($mhp->getMappedVariants() as $mappedVariant) {
[$data, $types] = $this->mappedHostProductVariantToInsertData($mhpvClassMeta, $mappedVariant);
$conn->insert($mhpvClassMeta->getTableName(), $data, $types);
$mhpvClassMeta->setFieldValue($mappedVariant, 'id', (int) $conn->lastInsertId());
}
}
});
}
private function mappedHostProductToInsertData(ClassMetadata $classMetadata, MappedHostProduct $mappedHostProduct) {
$fieldsToSave = ['hostProductId', 'hostMetadata', 'createdAt'];
$columns = $classMetadata->getColumnNames($fieldsToSave);
$columns[] = $classMetadata->getAssociationMapping('product')['joinColumns'][0]['name'];
return [
array_combine($columns, [
$mappedHostProduct->getHostProductId(),
$mappedHostProduct->getHostMetadata(),
$mappedHostProduct->getCreatedAt(),
$mappedHostProduct->getProductId(),
]),
array_combine($columns, array_merge(
f\arrayMap([$classMetadata, 'getTypeOfField'], $fieldsToSave),
[null]
))
];
}
private function mappedHostProductVariantToInsertData(ClassMetadata $classMetadata, MappedHostProductVariant $mappedVariant) {
$fieldsToSave = ['hostVariantId', 'hostMetadata', 'createdAt'];
$columns = $classMetadata->getColumnNames($fieldsToSave);
$columns[] = $classMetadata->getAssociationMapping('productVariant')['joinColumns'][0]['name'];
$columns[] = $classMetadata->getAssociationMapping('mappedHostProduct')['joinColumns'][0]['name'];
return [
array_combine($columns, [
$mappedVariant->getHostVariantId(),
$mappedVariant->getHostMetadata(),
$mappedVariant->getCreatedAt(),
$mappedVariant->getProductVariant()->getId(),
$mappedVariant->getMappedHostProduct()->getId()
]),
array_combine($columns, array_merge(
f\arrayMap([$classMetadata, 'getTypeOfField'], $fieldsToSave),
[null, null]
))
];
}
}
Abandon all hope ye who read: I'm not working on this, like... ever.
While I was at SymfonyCon this year, I discussed something interesting with @lisachenko: providing a
TableDataGateway
-alike interface for the ORM.The current doctrine2 approach
While this approach is very practical, it also exposes some serious memory/transactional/scope limitations, as the
$entityManager
leaks through different service layers.The proposed
TableDataGateway
approachThe difference here is that the repository does not push the entity into any
UnitOfWork
: there are noflush()
semantics, and calling$someRepository->save($entity);
triggers an actual SQL query.Compatibility
From what I can see, everything currently existing in the ORM could work exactly as it is working right now: just the hydration process would be simplified.
DQL
just produces new objects, rather than searching in theUnitOfWork
for existing instancesINSERT
/UPDATE
queries for all the already initialized sub-graph (we may examine this case further - maybe one unit of work per sub-graph?)Invariant changes
$someRepository->find(123) !== $someRepository->find(123);
Advantages
This sort of API has huge advantages when dealing with operations in:
persist()
an object before saving itCleaning up persisters/hydrators
In addition to all the above, this gives us the opportunity of examining scopes where the UnitOfWork leaked into, and to check whether we can get it out of there, and at what cost.