doctrine / orm

Doctrine Object Relational Mapper (ORM)
https://www.doctrine-project.org/projects/orm.html
MIT License
9.93k stars 2.51k forks source link

TableDataGateway interface for fetching/saving entities without UnitOfWork interactions #5550

Closed Ocramius closed 1 year ago

Ocramius commented 8 years ago

While I was at SymfonyCon this year, I discussed something interesting with @lisachenko: providing a TableDataGateway-alike interface for the ORM.

The current doctrine2 approach

$entity = $someRepository->find(123);
$entity->doSomething();
$entityManager->flush();

While this approach is very practical, it also exposes some serious memory/transactional/scope limitations, as the $entityManager leaks through different service layers.

The proposed TableDataGateway approach

$entity = $someRepository->find(123);
$entity->doSomething();
$someRepository->save($entity);

The difference here is that the repository does not push the entity into any UnitOfWork: there are no flush() semantics, and calling $someRepository->save($entity); triggers an actual SQL query.

Compatibility

From what I can see, everything currently existing in the ORM could work exactly as it is working right now: just the hydration process would be simplified.

This sort of API has huge advantages when dealing with operations in:

In addition to all the above, this gives us the opportunity of examining scopes where the UnitOfWork leaked into, and to check whether we can get it out of there, and at what cost.

Ocramius commented 8 years ago

I actually just realized that having a separate UnitOfWork for each fetched resultset would provide this functionality magically and quickly. This also means that we can't use array for resultsets anymore, but that should be quite ok.

TomasVotruba commented 8 years ago

If I understand this correctly, you'd like to add save() method that would persist & flush just the passed entity?

If so, I like it.

Ocramius commented 8 years ago

@TomasVotruba pretty much, except that there is no flush(), not even internally.

With these semantics, $someRepository->find(123) !== $someRepository->find(123);

Akii commented 8 years ago

This implies some kind of version mechanism. Doesn't look nice from what I know thus far.

Ocramius commented 8 years ago

@Akii can you elaborate a bit? There are no "versions", every time you call the repository, a new query is executed...

Akii commented 8 years ago

@Ocramius my idea of an repository is aligned to DDD. So when I ask it to give me an aggregate with ID x I assume it to be the same each time I do, regardless of whatever underlying storage is used.

Ocramius commented 8 years ago

@Akii if that's the case, then you'd still use the UnitOfWork backed API.

Akii commented 8 years ago

As for my versioning concern: $a = $repo->find(1); $b = $repo->find(1);

$repo->save($a); $repo->save($b);

This will then just commit $b? (as in write a, then overwrite with b)

(I thought this was some kind of replacement for the UnitOfWork, if that's still there I don't have any objections of course)

Ocramius commented 8 years ago

This will then just commit $b? (as in write a, then overwrite with b)

Unless you use a locking mechanism (as described in http://doctrine-orm.readthedocs.org/projects/doctrine-orm/en/latest/reference/transactions-and-concurrency.html ), then yes, it will silently UPDATE

Akii commented 8 years ago

Okay, the advantages are still not quite clear to me though (no troll intended, speaking of which.. you didn't today!).

I can see that you might have less complexity but that's solved already. And okay you can diff things, but I'm sure I'm missing the most points.

alextech commented 8 years ago

Not necessarily bad, but wouldn't it require knowledge of all objects ever changed? With UnitOfWork, I write module which may internally do some entity manipulations. Coworker writes their own module that use my as foundation with some more entity mutations. To avoid DB overload we agree to just do flush() at the end of the request. Will that still be possible?

Ocramius commented 8 years ago

@Akii updated OP

@alextech yes - this separate API would be completely separate and opt-in.

alextech commented 8 years ago

In that case not bad. I like the "long running processes" part especially. It is more natural to do $repository->save($obj) where $obj came from message bus subscription, for example, where in the queue subscriber there is not really a concept of full session object graph that needs committing/flushing.

DHager commented 8 years ago

@Akii Since you mention DDD, one wrinkle I've been concerned with is that versions are per-row/entity, rather than per-aggregate-root. This means that without extra workarounds, you can have collisions between subtrees. For example, consider a Plane entity that has two Engine entities, and all of them have versions.

$planeA = $repo->find(1); // Two-engine aircraft
$planeB = $repo->find(1);

$planeA->leftEngine->setInactive();
$planeB->rightEngine->setInactive();

assert($planeA->hasPower() == true);
assert($planeB->hasPower() == true);

// No conflicts because neither individual entity was touched twice
$repo->save($planeA); 
$repo->save($planeB); 

assert($repo->find(1)->hasPower() == false); // Oh no, we're gonna crash
geerteltink commented 8 years ago

@DHager I might be missing something, but I think in your example you first save $planeA and after that $planeB, which overwrites $planeA. So at the end you have this:

assert($repo->find(1)->hasPower() == true); // It still has the left engine from $planeB

It would go wrong in your example if it would only update the changed engine state, which is not the case since you save the entire object.

Jean85 commented 8 years ago

This seems an interesting idea! The identity behavior change may not be a problem... If I understood it well,

$someRepository->find(123) !== $someRepository->find(123);

but

 $someRepository->find(123) == $someRepository->find(123);    

Right? Two different instances, but with same values?

Akii commented 8 years ago

@DHager Good point. However @xtreamwayz should be right as it says "everything is treated dirty". So cascade persist would simply overwrite left engine when persisting B.

@Jean85 My gut feeling tells me that this might not be the case due to proxies (lazy loading etc.).

@Ocramius thanks for pointing out the advantages, makes really much sense in certain areas. Migrations could potentially also benefit from it.

Jean85 commented 8 years ago

@Akii shouldn't proxies be triggered by a comparison like that? At least, the danger is to recursively hydrate a lot of related entities..

beberlei commented 8 years ago

@Ocramius Say you could have multiple unit of works, wrapped in something like a "Transaction" object, then if its possible to replace the change set computation strategy, then this might be easily possible without changing "alot" of code.

Ocramius commented 8 years ago

@beberlei another thing I just checked is that we actually may need multiple UnitOfWork instances anyway. In a "semi-detached" graph, following example doesn't really work:

class Company {
    public $employees = [];
}
class Employee {
    public $company;
}
SELECT c FROM Company c
public function testInternalReferencesAreConsistent()
{
    foreach ($tdg->select('SELECT c FROM Company c') as $company) {
        foreach ($company->employees as $employee) {
            self::assertSame($company, $employee->company);
        }
    }
}

This sort of internal references need to be consistent, of course (also across multiple hops, which aren't hydrated in the first hydration step), and to do so we'd need a "detached UnitOfWork" inside any proxy or collection inside the graph spawned by the initial query SELECT c FROM Company c.

Note that one unit of work per subgraph doesn't actually complicate things, and may actually optimize update operations as well...

Bilge commented 8 years ago

It's like I'm really using Doctrine 1.

Ocramius commented 8 years ago

Can you elaborate? Doctrine 1 is an active-record implementation: this has nothing to do with that.

BenMorel commented 5 years ago

I would also :heart: to have such an implementation that does not rely on the UoW.

Basically, when doing performance sensitive, hightly concurrent stuff, I prefer to revert to plain SQL queries to ensure that my entities are read and written exactly how and when I need it.

Also, in some batch scripts, I do not want to load a lot of objects into memory, yet I potentially want to perform a large job in a single transaction, so the UoW gets in the way, as the identity map keeps a reference to all the loaded objets, and prevents the garbage collector from doing its job. Yes, you can flush and clear the EntityManager along the way, but it's just a workaround.

I would love to be able to work with objects only (no more SQL queries in my code), and have a lower level API for reading/writing objects, while purposely losing the advantages that come with the unit of work / identity map.

Is this still in the works? Would it be a thing for Doctrine 3.0?

Ocramius commented 5 years ago

This is not currently being worked on: I have no throughput for OSS these months, as I'm mostly working to save some $$$ first.

On Thu, 15 Nov 2018, 14:42 Benjamin Morel <notifications@github.com wrote:

I would also ❤️ to have such an implementation that does not rely on the UoW.

Basically, when doing performance sensitive, hightly concurrent stuff, I prefer to revert to plain SQL queries to ensure that my entities are read and written exactly how and when I need it.

Also, in some batch scripts, I do not want to load a lot of objects into memory, yet I potentially want to perform a large job in a single transaction, so the UoW gets in the way, as the identity map keeps a reference to all the loaded objets, and prevents the garbage collector from doing its job. Yes, you can flush and clear the EntityManager along the way, but it's just a workaround.

I would love to be able to work with objects only (no more SQL queries in my code), and have a lower level API for reading/writing objects, while purposefully losing the advantages that come with the unit of work / identity map.

Is this still in the works? Would it be a thing for Doctrine 3.0?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/doctrine/doctrine2/issues/5550#issuecomment-439043750, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJakK-UgJBb1_xonWsq3H9Upl-apmcUks5uvW80gaJpZM4Gz5Cp .

BenMorel commented 5 years ago

I'd like to start playing with this idea someday.

Just one question: I'm perfectly happy with the fact that there is no "global" identity map (i.e. $repo->find(123) !== $repo->find(123)). I'm wondering, however, what should happen when loading an entity that has twice a reference to another object (in the entity itself or in an eager-loaded relationship), should the two objects be the same instance or would that not even be guaranteed?

A similar question comes to mind, when you load an object with criteria including existing entities, should the loader re-use these objects instead of loading them again and create new ones?

Example: say we have an object whose identity is a couple of objects:

class Follow
{
    /** @var User */
    private $follower;

    /** @var User */
    private $followee;
}

If I do something like:

$repo->find([
    'follower' => $user1,
    'followee' => $user2
]);

Would $follower === $user1 and $followee === $user2? Or would it load new objects, just using the existing objects' identities?

What I had in mind is that the loader/repo/whatever could have a transient identity map that only exists for the current loading operation.

This way, when an object is referenced twice in an objet's graph, they're guaranteed to be the same, for this operation only.

And when you search objects by providing other objects, as in the example above, these object instances are included in the loaded object (the loader would add them to a blank identity map before starting to load the object graph).

What do you think?

Ocramius commented 5 years ago

@BenMorel I think the identity within the set of loaded records, as well as eventual lazy-loading within it, should be preserved. The current simplistic solution is to have one UnitOfWork per loaded set, but that's very inefficient, so I didn't explore further until I got a better idea.

BenMorel commented 5 years ago

@Ocramius Thanks for your reply, we're on the same page here.

Another thing I had in mind, is where do we store the state of the entity?

For entities that have a GeneratedValue identity, it's easy: if the value is null, then the entity is NEW.

However, for entities having a natural identifier, it's tougher: we don't know if we should INSERT or UPDATE without hitting the DB first (so, we potentially have to hit the DB twice per save() call).

We can't store the state of the entity in the repo, as this would defeat the whole purpose of this feature (leaking memory); and we can hardly store the state in the entity itself (we'd be polluting the entity with dynamically created fields, which may hurt serialization and maybe other things).

Same thing for dirty checks: because we have nowhere to store the original values, we cannot compare the current values before writing to the DB, so we'd have to write all the fields every time; I'm not sure whether this is really a problem, but this is something I don't like for sure.

Do you have any ideas on this matter?

addiks commented 5 years ago

You can prevent hitting the DB twice by performing a combined INSERT & UPDATE using the ON DUPLICATE KEY UPDATE syntax having all fields set in the UPDATE clause. That way only one operation would be needed even without knowing the state in the database.

(Slightly off-topic but kind of same direction: Two years ago i experimented with changing doctrine's behavior to better handle batch-processing, transactions & rollbacks without changing the current flush semantics. It works by having one UOW per transaction-level containing the state of entities at the start of the transaction. That way it is easy to revert / clean the UOW at the end of each batch-iteration, also it encourages working with transactions. Maybe you want to take a look at that.)

BenMorel commented 5 years ago

Thanks for your thoughts, @addiks; your experiment looks really interesting!

Regarding ON DUPLICATE KEY UPDATE, I thought about this, but there are two caveats:

zmitic commented 5 years ago

I am not sure I understand; does this mean identity-map pattern will be gone?

That's a big game changer. IMP allowed me to change my entities wherever I wanted in the code. To split complex situations, I make tagged services where sometimes they will work on same entity. And if they change it, I must be sure all changes are saved to DB, not just the last one executed.

I know it is very rare and maybe my explanation is bad but I have working example. Site is crazy complex, and I can use logic only from Many2One side which means I need to trigger queries.

i.e. I can't use $entity->getAssociation() but only $assocRepository->findForEntity($entity).

If interested, I can describe the problem. But it really is crazy situation :smiley:

PS: I am not talking about optimization like if I search by ID, Doctrine will return one from memory w/o hitting database; it is amazing, but I can live w/o that. I am only interested in having identical objects all the time.


Will it be possible to keep IMP, even at the cost of performance?

Ocramius commented 5 years ago

The idea is precisely to kill IMP for unrelated aggregate operations.

Separate aggregates should be handled separately, in separate transactions and UnitOfWork instances.

On Thu, 3 Jan 2019, 05:36 Zeljko Mitic <notifications@github.com wrote:

I am not sure I understand; does this mean identity-map pattern will be gone?

That's a big game changer. IMP allowed me to change my entities wherever I wanted in the code. To split complex situations, I make tagged services where sometimes they will work on same entity. And if they change it, I must be sure all changes are saved to DB, not just the last one executed.

I know it is very rare and maybe my explanation is bad but I have working example. Site is crazy complex, and I can use logic only from Many2One side which means I need to trigger queries.

i.e. I can't use $entity->getAssociation() but only $assocRepository->findForEntity($entity).

If interested, I can describe the problem. But it really is crazy situation 😃

PS: I am not talking about optimization like if I search by ID, Doctrine will return one from memory w/o hitting database; it is amazing, but I can live w/o that. I am only interested in having identical objects all the time.

Will it be possible to keep IMP, even at the cost of performance?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/doctrine/doctrine2/issues/5550#issuecomment-451054354, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJakDcZfopXwa74s5tbGLzanCymuilGks5u_Yi1gaJpZM4Gz5Cp .

zmitic commented 5 years ago

But is it possible to keep IMP and add this as new feature? From my POV, few ms faster is not as important as having cleanly separated code.

If of any relevance, I have hobby project that persists 100 millions rows table, using only entities (no plain SQL or similar). Top speed (5 column entity) was around 9000/second, done in batches of 4000, indexes turned off till script finishes.

Which I say is pretty impressive, I would never think it would be possible in PHP. Is it that important to have faster writes with this new idea but loose flexibility of IMP, when it already is fast?

Ocramius commented 5 years ago

As I mentioned above, this is precisely about removing IMP. Each fetched record is independent from the others, by design, in order to keep state isolated between separate retrieved aggregates.

This API is not about batch operations, but about isolation of object graphs.

BenMorel commented 5 years ago

@zmitic As I understand it, this new API would not replace the current Doctrine behaviour with Unit Of Work and Identity Map, it would just be another, optional way to use Doctrine for particular use cases.

zmitic commented 5 years ago

@BenMorel I thought that at first but later comments confused me. If it is extra, that would be great, as long as UOW and IMP can be used as now.

addiks commented 5 years ago
* I'm not sure whether this is portable: do all database platforms have a similar mechanism?

I just stumbled over the answer to this, in other DBMS's this is called UPSERT or MERGE:

https://en.wikipedia.org/wiki/Merge_(SQL)

It looks like pretty much every important DBMS has this concept or something similar, but it is non-standard and every DBMS uses it's own Syntax: UPSERT / MERGE INTO / INSERT ON DUPLICATE UPDATE / INSERT ON CONFLICT / INSERT OR REPLACE / WHEN MATCHED / ...

DHager commented 5 years ago

I feel like I don't quite understand the vision behind this issue. The very-similar-sounding thing I wish existed would simply be control over switching which UoW/IdentityMap is currently being used, rather than having a very different API that bypasses the global implicit ones.

That way the underlying problem of "how much isolation is needed and between which parts" gets moved "up" the code-stack to a level of abstraction where the developer has enough domain-knowledge to decide.

This isn't a great API, but to illustrate the idea:

// Previous normal behavior
$entity = $someRepository->find(123);
$entity->doSomething();
$entityManager->flush();

// Rough cut of UoW-switching approach
$temporaryContext = $entityManager->createContext(); // Blank UoW/IdMap/etc.
$originalContext = $entityManager->switchContext(temporaryContext);
$entity = $someRepository->find(123);
$entity->doSomething();
$entityManager->flush();
$entityManager->switchContext($originalContext);

// Memory gets reclaimed when $temporaryContext goes out of scope
BenMorel commented 5 years ago

@addiks That's what I've seen too, it looks like some kind of UPSERT is available in pretty much every RDBMS. My other concern still stands, though: there is no way, at least in MySQL, to know whether it's the PRIMARY KEY that was a duplicate, or another unique key.

For example, attempting to upsert a record with a primary key and a unique email field, doing ON DUPLICATE KEY UPDATE might update a record with the same email but a different primary key, which is probably not what we want here.

As far as I can see, there is no way around this in MySQL, so the only portable way seems to be to UPDATE and check the number of matched rows, if zero then INSERT.

BenMorel commented 5 years ago

@DHager The idea here was precisely to avoid the overhead of a UnitOfWork / IdentityMap for simple use cases when you want to maximize throughput, and don't need the guarantee that entities with the same identity will be the same object. This would be reserved for basic CRUD operations.

After playing a bit with the idea though, I noticed that as soon as your entities start having relationships to each other, this starts being kind of a mess, and the hopes of simplification that this idea brought quickly vanish:

I couldn't see how much I was tied to the current Doctrine behaviour, even for the simplest use cases, until I tried to get rid of it. Lesson learned; the approach suggested here is therefore probably only good for entities with next to no relationships (only scalars), which is rarely what we have.

I think that your approach is a good one, this is pretty much what I have in mind now as well: a disposable UnitOfWork.

We can already achieve almost the same thing using EntityManager::clear(), or creating multiple EntityManager instances. I'm not sure what overhead this carries though, when doing this in batches.

Ocramius commented 5 years ago

I'm not really worried much about performance that much, but rather isolation of state between aggregates: each loaded object gets its own object graph.

BenMorel commented 5 years ago

@Ocramius

As I mentioned above, this is precisely about removing IMP. Each fetched record is independent from the others, by design, in order to keep state isolated between separate retrieved aggregates.

This API is not about batch operations, but about isolation of object graphs.

I had the same utopia regarding aggregates, but the thing is, an entity inside your aggregate might still reference another aggregate root, so you still need an identity map in case several of your entities inside your aggregate point to the same external aggregate root (plus you'll need an identity map inside your aggregate, too).

There is probably room for optimization, however, by deliberately saving a single entity, and not cascading save to other aggregates. This is in line with your last comment as well: providing isolation of state between aggregates.

But getting rid of the identity map, I think that this is not feasible.

addiks commented 5 years ago

@addiks That's what I've seen too, it looks like some kind of UPSERT is available in pretty much every RDBMS. My other concern still stands, though: there is no way, at least in MySQL, to know whether it's the PRIMARY KEY that was a duplicate, or another unique key.

I just fiddled around with it and i may have found a way, but it is not pretty:

INSERT INTO new_table
(id, name, email)
VALUES
(456, 'John Jimmy Doe', 'john.doe@example.com')
ON DUPLICATE KEY 
UPDATE name=IF(id != VALUES(id), name, VALUES(name));

Complete demonstration SQL: https://gist.github.com/addiks/b3f507a45dc12e9f6b09b7b42740c823

BenMorel commented 5 years ago

I just fiddled around with it and i may have found a way, but it is not pretty

Clever, but you still have to issue a second query if the duplicate occurs on a unique key (in which case your statement is a no-op and you want to insert another row).

ragboyjr commented 5 years ago

@Ocramius would this be somewhat similar to Stateless Sessions in Hibernate: http://docs.jboss.org/hibernate/orm/5.4/userguide/html_single/Hibernate_User_Guide.html#_statelesssession

Ocramius commented 5 years ago

Yes, seems similar

On Sun, 7 Apr 2019, 09:05 RJ Garcia, notifications@github.com wrote:

@Ocramius https://github.com/Ocramius would this be somewhat similar to Stateless Sessions in Hibernate: http://docs.jboss.org/hibernate/orm/5.4/userguide/html_single/Hibernate_User_Guide.html#_statelesssession

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/doctrine/orm/issues/5550#issuecomment-480565028, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJakGbSI78jusjendHoJk2sPeuRnsojks5veZjRgaJpZM4Gz5Cp .

ragboyjr commented 5 years ago

Awesome, ya, this type of feature would be incredibly helpful when it comes to batch processing, and also logging entities to the db. Doctrine's biggest crutch imho is not being able to gracefully handle db exceptions and the UoW has to get cleared which makes any type of batch processing a nightmare to recover from in those scenarios.

ragboyjr commented 5 years ago

This is a rough PoC, but I think something like this would suffice. I'd imagine a lot of this could be standardized more/less. I got this code from looking at how the BasicEntityPersister works. This example is for storing a mapping entity which maps ids from a host system to a different system for products and their variants.

final class MappedHostProductPersister
{
    private $em;

    public function __construct(EntityManagerInterface $em) {
        $this->em = $em;
    }

    /** @param MappedHostProduct[] $mappedProducts */
    public function save(array $mappedProducts) {
        $conn = $this->em->getConnection();
        $mhpClassMeta = $this->em->getClassMetadata(MappedHostProduct::class);
        $mhpvClassMeta = $this->em->getClassMetadata(MappedHostProductVariant::class);

        $conn->transactional(function(Connection $conn) use ($mappedProducts, $mhpClassMeta, $mhpvClassMeta) {
            foreach ($mappedProducts as $mhp) {
                [$data, $types] = $this->mappedHostProductToInsertData($mhpClassMeta, $mhp);
                $conn->insert($mhpClassMeta->getTableName(), $data, $types);
                $mhpClassMeta->setFieldValue($mhp, 'id', (int) $conn->lastInsertId());
                foreach ($mhp->getMappedVariants() as $mappedVariant) {
                    [$data, $types] = $this->mappedHostProductVariantToInsertData($mhpvClassMeta, $mappedVariant);
                    $conn->insert($mhpvClassMeta->getTableName(), $data, $types);
                    $mhpvClassMeta->setFieldValue($mappedVariant, 'id', (int) $conn->lastInsertId());
                }
            }
        });
    }

    private function mappedHostProductToInsertData(ClassMetadata $classMetadata, MappedHostProduct $mappedHostProduct) {
        $fieldsToSave = ['hostProductId', 'hostMetadata', 'createdAt'];
        $columns = $classMetadata->getColumnNames($fieldsToSave);
        $columns[] = $classMetadata->getAssociationMapping('product')['joinColumns'][0]['name'];

        return [
            array_combine($columns, [
                $mappedHostProduct->getHostProductId(),
                $mappedHostProduct->getHostMetadata(),
                $mappedHostProduct->getCreatedAt(),
                $mappedHostProduct->getProductId(),
            ]),
            array_combine($columns, array_merge(
                f\arrayMap([$classMetadata, 'getTypeOfField'], $fieldsToSave),
                [null]
            ))
        ];
    }

    private function mappedHostProductVariantToInsertData(ClassMetadata $classMetadata, MappedHostProductVariant $mappedVariant) {
        $fieldsToSave = ['hostVariantId', 'hostMetadata', 'createdAt'];
        $columns = $classMetadata->getColumnNames($fieldsToSave);
        $columns[] = $classMetadata->getAssociationMapping('productVariant')['joinColumns'][0]['name'];
        $columns[] = $classMetadata->getAssociationMapping('mappedHostProduct')['joinColumns'][0]['name'];

        return [
            array_combine($columns, [
                $mappedVariant->getHostVariantId(),
                $mappedVariant->getHostMetadata(),
                $mappedVariant->getCreatedAt(),
                $mappedVariant->getProductVariant()->getId(),
                $mappedVariant->getMappedHostProduct()->getId()
            ]),
            array_combine($columns, array_merge(
                f\arrayMap([$classMetadata, 'getTypeOfField'], $fieldsToSave),
                [null, null]
            ))
        ];
    }
}
Ocramius commented 1 year ago

Abandon all hope ye who read: I'm not working on this, like... ever.