Support more complex embedded arrays

jyemin commented 2 months ago

Currently the implementation supports simple embedded arrays, e.g.

@Entity
class Movie {
    @Id
    Integer id;
    List<String> tags;
}

maps to a document like

{
   _id : 1, 
   tags : ["comedy", "drama"]
}

But MongoDB arrays can contain arbitrarily nested structure, and it is important that any Hibernate integration can take advantage of it. For example, a model like this:

@Entity
class Movie {
    @Id
    Integer id;
    List<TagsByAuthor> tagsByAuthor;
}

@Embeddable
class TagsByAuthor {
  String author;
  List<String> tags;
}

should map to a document like:

{
   _id : 1, 
   tagsByAuthor : [
      { 
         author: "firstAuthor",
         tags: ["comedy", "drama"]
      },
       { 
         author: "secondAuthor",
         tags: ["funny", "sad"]
      }
   ]
}

This will have implications for querying, as an application should be able to perform queries like:

find all movies which have a given tag
find all movies which have a given tag associated with a given author (this will require use of $elemMatch in MQL)

and also implications for inserting/updating/deleting items in the array (which will require use of $push and $pull, etc).

NathanQingyangXu commented 2 months ago

Thanks for the issue created. I'll try to see how to solve it in most elegant way.

NathanQingyangXu commented 2 months ago

Saw the issue is to be sovled in next v6.6 (currently still in development). Will keep eyes on v6.6 release to tap into it.

jyemin commented 2 months ago

Here's a reproducer: https://github.com/jyemin/JPA-MongoDB-mapping/tree/26

Note that there are two commits. In the second I update hibernate-orm to 6.6.0.CR1 to demonstrate that the behavior remains the same.

Also note that I use the ElementCollection annotation in the reproducer in order to get the test to complete and fail the assertion. However, I do like the current behavior of List<String> where no annotation is required and the array is embedded by default, and it would be nice to have the same behavior for a List of Embeddable (or a List of List of Embeddable).

jyemin commented 2 months ago

Related Hibernate issue is https://hibernate.atlassian.net/browse/HHH-15862.

The fix for it required a change to SqlAstWalker.

jyemin commented 2 months ago

Interesting: in the release announcement for 6.6, the example uses the Struct annotation on the Embeddable:

@Entity
class MyEntity {
        List<MyEmbeddable> embeddableAggregateList;
}

@Struct
@Embeddable
class MyEmbeddable { ... }

jyemin commented 2 months ago

Still using 6.6.0.CR1, I tried adding Struct annotation (and removing ElementCollection), and I get a more interesting exception:

Dialect does not support aggregateComponentAssignmentExpression: org.hibernate.dialect.aggregate.AggregateSupportImpl
java.lang.UnsupportedOperationException: Dialect does not support aggregateComponentAssignmentExpression: org.hibernate.dialect.aggregate.AggregateSupportImpl
    at org.hibernate.dialect.aggregate.AggregateSupportImpl.aggregateComponentAssignmentExpression(AggregateSupportImpl.java:40)
    at org.hibernate.boot.model.internal.AggregateComponentSecondPass.doSecondPass(AggregateComponentSecondPass.java:146)
    at org.hibernate.boot.internal.InFlightMetadataCollectorImpl.processSecondPasses(InFlightMetadataCollectorImpl.java:1842)
    at org.hibernate.boot.internal.InFlightMetadataCollectorImpl.processSecondPasses(InFlightMetadataCollectorImpl.java:1805)
    at org.hibernate.boot.model.process.spi.MetadataBuildingProcess.complete(MetadataBuildingProcess.java:334)
    at org.hibernate.boot.model.process.spi.MetadataBuildingProcess.build(MetadataBuildingProcess.java:129)
    at org.hibernate.boot.internal.MetadataBuilderImpl.build(MetadataBuilderImpl.java:449)
    at org.hibernate.boot.internal.MetadataBuilderImpl.build(MetadataBuilderImpl.java:101)
    at org.hibernate.cfg.Configuration.buildSessionFactory(Configuration.java:949)
    at org.hibernate.cfg.Configuration.buildSessionFactory(Configuration.java:999)

NathanQingyangXu commented 2 months ago

Hehe, I will take a look. Yeah, I agree no @ElementCollection usage is better so we can maintain JPA backward compatibility. That @Struct annotation could be customized to be added automatically in existing hooking point. I believe all these issues would be solved sooner or later.

On Fri, Aug 2, 2024, 11:11 a.m. Jeff Yemin @.***> wrote:

Still using 6.6.0.CR1, I tried adding Struct annotation (and removing ElementCollection), and I get a more interesting exception:

Dialect does not support aggregateComponentAssignmentExpression: org.hibernate.dialect.aggregate.AggregateSupportImpl java.lang.UnsupportedOperationException: Dialect does not support aggregateComponentAssignmentExpression: org.hibernate.dialect.aggregate.AggregateSupportImpl at org.hibernate.dialect.aggregate.AggregateSupportImpl.aggregateComponentAssignmentExpression(AggregateSupportImpl.java:40) at org.hibernate.boot.model.internal.AggregateComponentSecondPass.doSecondPass(AggregateComponentSecondPass.java:146) at org.hibernate.boot.internal.InFlightMetadataCollectorImpl.processSecondPasses(InFlightMetadataCollectorImpl.java:1842) at org.hibernate.boot.internal.InFlightMetadataCollectorImpl.processSecondPasses(InFlightMetadataCollectorImpl.java:1805) at org.hibernate.boot.model.process.spi.MetadataBuildingProcess.complete(MetadataBuildingProcess.java:334) at org.hibernate.boot.model.process.spi.MetadataBuildingProcess.build(MetadataBuildingProcess.java:129) at org.hibernate.boot.internal.MetadataBuilderImpl.build(MetadataBuilderImpl.java:449) at org.hibernate.boot.internal.MetadataBuilderImpl.build(MetadataBuilderImpl.java:101) at org.hibernate.cfg.Configuration.buildSessionFactory(Configuration.java:949) at org.hibernate.cfg.Configuration.buildSessionFactory(Configuration.java:999)

— Reply to this email directly, view it on GitHub https://github.com/NathanQingyangXu/JPA-MongoDB-mapping/issues/26#issuecomment-2265617255, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6UYAXHFO36S5372XOAJ5TZPOOTDAVCNFSM6AAAAABL3O2C4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRVGYYTOMRVGU . You are receiving this because you commented.Message ID: @.***>

NathanQingyangXu commented 2 months ago

Still using 6.6.0.CR1, I tried adding Struct annotation (and removing ElementCollection), and I get a more interesting exception:

Dialect does not support aggregateComponentAssignmentExpression: org.hibernate.dialect.aggregate.AggregateSupportImpl
java.lang.UnsupportedOperationException: Dialect does not support aggregateComponentAssignmentExpression: org.hibernate.dialect.aggregate.AggregateSupportImpl
  at org.hibernate.dialect.aggregate.AggregateSupportImpl.aggregateComponentAssignmentExpression(AggregateSupportImpl.java:40)
  at org.hibernate.boot.model.internal.AggregateComponentSecondPass.doSecondPass(AggregateComponentSecondPass.java:146)
  at org.hibernate.boot.internal.InFlightMetadataCollectorImpl.processSecondPasses(InFlightMetadataCollectorImpl.java:1842)
  at org.hibernate.boot.internal.InFlightMetadataCollectorImpl.processSecondPasses(InFlightMetadataCollectorImpl.java:1805)
  at org.hibernate.boot.model.process.spi.MetadataBuildingProcess.complete(MetadataBuildingProcess.java:334)
  at org.hibernate.boot.model.process.spi.MetadataBuildingProcess.build(MetadataBuildingProcess.java:129)
  at org.hibernate.boot.internal.MetadataBuilderImpl.build(MetadataBuilderImpl.java:449)
  at org.hibernate.boot.internal.MetadataBuilderImpl.build(MetadataBuilderImpl.java:101)
  at org.hibernate.cfg.Configuration.buildSessionFactory(Configuration.java:949)
  at org.hibernate.cfg.Configuration.buildSessionFactory(Configuration.java:999)

This is because this struct based feature is only supported by very few vendors (currently PostgreSQL and Oracle), so if dialect doesn't support it, exception is supposed to be thrown. This new dialect feature is new thing so naturally our mongodb dialect will return default value or false without overriding the method.

Seems the Struct is for UDT in SQL, which requires beforehand UDT spec just as table spec (e.g. https://www.postgresql.org/docs/current/rowtypes.html). We might well simulate this feature in mongo dialect (no UDT spec is needed, naturally) and insert the @Struct automatically if needed during metamodel contribution spi (as what I've done for changing the @Id column name to _id), but it seems there is always another flexible way to store JSON directly, as show in the PR as well, but I guess this strict struct modeling has the benefit to support struct component column joining, so it seems it is more promising to mongo dialect.

NathanQingyangXu commented 2 months ago

we can rest assured that we don't need to worry about query capability (like $elemMatch) too much for regardless of the storage mechanism (JSON or struct simulation), when it comes to query, it is easy to render. Especially, HQL or SQM supports both in and elements() predicates (e.g. see https://docs.jboss.org/hibernate/orm/current/userguide/html_single/Hibernate_User_Guide.html#hql-elements-indices) so it should be straightforward to translate correct Mongo Bson command.

NathanQingyangXu commented 2 months ago

Here's a reproducer: https://github.com/jyemin/JPA-MongoDB-mapping/tree/26

Note that there are two commits. In the second I update hibernate-orm to 6.6.0.CR1 to demonstrate that the behavior remains the same.

Also note that I use the ElementCollection annotation in the reproducer in order to get the test to complete and fail the assertion. However, I do like the current behavior of List<String> where no annotation is required and the array is embedded by default, and it would be nice to have the same behavior for a List of Embeddable (or a List of List of Embeddable).

I fixed the persisting issue at my branch: https://github.com/NathanQingyangXu/JPA-MongoDB-mapping/tree/issue-26. There is still some minor issue if you inspect the result, but I am sure it is solvable spending a little bit more time.

Actually we could rest assured we should be 100% safe here for Mongo's array usage is very similar to Postgres's Array type (https://www.postgresql.org/docs/current/arrays.html), so if in v6.6 they are trying to fill the gap (I need to copy lots of classes with Postgres prefixes into mongo) for Postgres, it should be little difficulty to simulate the feature in Mongo (with some minor difference of syntax sugars). As I mentioned, no need to worry about processing (fetch or mutation) for SQM supports these atomic semantic predicate and it is an easy task to translate the AST tree into Mongo's JSON command.

NathanQingyangXu commented 2 months ago

I investigated and found Beikov supported the following two features in both v6.2 and v6.6:

SQL array support (v6.2)
SQL UDT (User Defined Type) support (v6.6)

To support your requirement, naturally we need both. Currently Beikov's implementation is purely driven by PostgreSQL (which is the cloest SQL to Mongo, including its GIS, upinsert, array, UDT, etc.). We need to customize his logic flow to integrate with our virtual JDBC layer (which only accepts valid JSON command). I think we need some JSON codec or serialization feature in lieu of the current PG's array type PreparedStatement parameter value.

I think basic CRUD should be fine, though with some doubts on query and mutation. Will explore further. Tons of thanks go to Beikov.

jyemin commented 2 months ago

From what I can glean, the main difference between Postgres and MongoDB when it comes to UDTs is that in Postgres the UDT itself defines both the names and the order of its fields, so the field names don't have to be included in the serialized form; they just have to be passed to the JDBC driver in expected order. But with MongoDB both the field names and their values have to be passed into the JDBC driver so that they can be serialized together with their values in a BSON document.

It's a bit odd how a table and a UDT are different in this regard (i.e. SQL insert statements include the column names), but that's how it is.

NathanQingyangXu commented 2 months ago

Exactly, that is the key difference. I've managed to implement Mongo way to include field name with the value, and I've made the persistence work today. For now I am more concerned with the operators on Mongo array. I saw Beikov implements some basic ones like "contains", "append", "prepend", etc. But currently there is no "$elemMatch" counterpart (something like "array_filter"), though it is not difficult to implement it from scratch by including it into Hibernate's grammer. Ideally it is added in upstream, not in our side.

On Sat, Aug 3, 2024, 11:18 a.m. Jeff Yemin @.***> wrote:

From what I can glean, the main difference between Postgres and MongoDB when it comes to UDTs is that in Postgres the UDT itself defines both the names and the order of its fields, so the field names don't have to be included in the serialized form; they just have to be passed to the JDBC driver in expected order. But with MongoDB both the field names and their values have to be passed into the JDBC driver so that they can be serialized together with their values in a BSON document.

It's a bit odd how a table and a UDT are different in this regard (i.e. SQL insert statements include the column names), but that's how it is.

— Reply to this email directly, view it on GitHub https://github.com/NathanQingyangXu/JPA-MongoDB-mapping/issues/26#issuecomment-2266790888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6UYAXRJLCOBAHZZM3APMDZPTYFNAVCNFSM6AAAAABL3O2C4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRWG44TAOBYHA . You are receiving this because you commented.Message ID: @.***>

jyemin commented 2 months ago

Would a subquery with exists have the desired intent? Something like:

from UserProfile up
where exists (
    select a from up.addresses a
    where a.state = :state and a.city = :city
)

NathanQingyangXu commented 2 months ago

It is more subtle than that and I only know more after I explored a little bit yesterday. Subquery is a basic building block in SQM and could show up anywhere an expression is allowed. In the above example, it is the sole expression of exists predicate (we could easily render any SQM predicate in Mongo), so it could end up with the "" placeholder in the following Mongo counterpart:

db.survey.find(
{ results: { $elemMatch: { $exists: *<subquery>* } } }
)

but the key part is how to render $elemMatch, not its predicate parameter. The basic semantic is

filter in any array element satisfying the predicate

so it is not an atomic operation so for the first iteration of array feature introduction it is not included.

Beikove started introducing array features in Hibernate since v6.2 and he especially introduced the following common array functions:

array_contains: whether first array expression contains one scalar value as the second parameter
array_includes: the difference with the above is its second parameter is another array; returns whether second parameter is a subset of the first parameter
array_prepend: insert scalar to the beginning of array
array_append: insert scalar to the end of array

But even for the above functions, there are big differences between them and MQL counterparts. Take array_append as an example. It is still a literal function returning a new array, but Mongo's $push will change the array field directly.

I think it boils down to some key difference between ORM and Spring Data Mongo (which supports perfectly all the native Mongo commands including the above all). ORM tries to make data persistence details hidden from Java developers and they could work on POJO objects as if everything happens in memory (but paradoxically the necessary session details are mind-blowing and super complex). So preferentially ORM develoepr will do the following steps ending up with array appending:

fetch entity from db
modify its array field in normal Java way, e.g. to allow for appending, collection field instead of array field type might have been chosen
flush session (implicitly or explicitly)

Then ORM will promise the POJO status in memory will be synced up automatically ending up with the changes made for POJO. The above array function is used for edge cases when complicated query (select query and mutation query) is required. So missing the counterparts of some MQL atomics won't matter too much.

As a digression, the whole basic session idea is based on Martin Fowler's Unit of Work concept (https://martinfowler.com/eaaCatalog/unitOfWork.html) which was mentioned at the very beginning of Hibernate's User Guide (e.g. https://docs.jboss.org/hibernate/orm/current/userguide/html_single/Hibernate_User_Guide.html#getting-started). It has many implications and CRUD developers or tools usually are fuzzy about its complexity. One of the direct implication is when at the end of Unit of Work, the db updating might conflict with other session's updating so Optimistic Locking mechanism is a must to guard against it ( https://docs.jboss.org/hibernate/orm/current/userguide/html_single/Hibernate_User_Guide.html#locking-optimistic ).

Such a programming paradigm switch is the number one benefit of the library we are trying to develop. That being said, support of $elemMatch is still desirable and I am not sure whether it is in Beikov's plan list.

On Sat, Aug 3, 2024 at 2:58 PM Jeff Yemin @.***> wrote:

Would a subquery with exists have the desired intent? Something like:

from UserProfile up where exists ( select a from up.addresses a where a.state = :state and a.city = :city )

— Reply to this email directly, view it on GitHub https://github.com/NathanQingyangXu/JPA-MongoDB-mapping/issues/26#issuecomment-2267101757, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6UYAQH76NHWB46WUE35BDZPUR6XAVCNFSM6AAAAABL3O2C4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRXGEYDCNZVG4 . You are receiving this because you commented.Message ID: @.***>

jyemin commented 1 month ago

From what I can glean, the main difference between Postgres and MongoDB when it comes to UDTs is that in Postgres the UDT itself defines both the names and the order of its fields, so the field names don't have to be included in the serialized form; they just have to be passed to the JDBC driver in expected order. But with MongoDB both the field names and their values have to be passed into the JDBC driver so that they can be serialized together with their values in a BSON document.

I would really like to know if there's going to be anyway to work around this. It seems baked not only into Postgres but into the API of java.sql.Struct itself, which exposes an array of attributes without any attribute names.

NathanQingyangXu commented 1 month ago

I think we are not restricted by them and I am sure we can work it out. Tomorrow I will switch gear to start fully exploring it.

On Mon, Aug 26, 2024, 6:16 p.m. Jeff Yemin @.***> wrote:

From what I can glean, the main difference between Postgres and MongoDB when it comes to UDTs is that in Postgres the UDT itself defines both the names and the order of its fields, so the field names don't have to be included in the serialized form; they just have to be passed to the JDBC driver in expected order. But with MongoDB both the field names and their values have to be passed into the JDBC driver so that they can be serialized together with their values in a BSON document.

I would really like to know if there's going to be anyway to work around this. It seems baked not only into Postgres but into the API of java.sql.Struct itself, which exposes an array of attributes without any attribute names.

— Reply to this email directly, view it on GitHub https://github.com/NathanQingyangXu/jpa-mongodb-mapping/issues/26#issuecomment-2311199832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6UYAUZG2P2YAHYPVBEMQLZTOSLDAVCNFSM6AAAAABL3O2C4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGE4TSOBTGI . You are receiving this because you commented.Message ID: @.***>

NathanQingyangXu commented 1 month ago

I created a DRAFT PR at https://github.com/NathanQingyangXu/jpa-mongodb-mapping/pull/40. It seems straightforward to implement and boils down to a matter of time.

Even for now, the testing result has been quite close to what we want. E.g. for the testing case, the following log statements were issued:

2024-08-27 10:48:19 [Test worker] DEBUG org.hibernate.SQL - { insert: "Movie", documents: [ { tagsByAuthor: {$undefined: true}, title: {$undefined: true}, _id: {$undefined: true} } ], comment: "insert for org.hibernate.omm.array.StructTests$Movie" }
2024-08-27 10:48:19 [Test worker] TRACE org.hibernate.orm.jdbc.bind - binding parameter (1:ARRAY) <- [[TagsByAuthor{author='Nathan', tags=[comedy, drama]}]]
2024-08-27 10:48:19 [Test worker] TRACE org.hibernate.orm.jdbc.bind - binding parameter (2:VARCHAR) <- [Forrest Gump]
2024-08-27 10:48:19 [Test worker] TRACE org.hibernate.orm.jdbc.bind - binding parameter (3:CHAR) <- [a0e21099-bad5-405b-bf95-4caa71c29913]

As you can see, currently the serialization is based on toString() output of the STRUCT Embeddable class. Maybe JSON JDBC type is more appropriate and we need to do our due dillgence to finish the JSON seralization logic (I assume straightforward).

jyemin commented 1 month ago

Nice!

For a production quality implementation, I'd be worried about using a different representation for structs (e.g JSON) than for top-level columns of tables or of top-level arrays. Looking forward to exploring the design space a bit further.

NathanQingyangXu commented 1 month ago

Yeah, we need good product design. Some JPA feature has no intuitive counterpart in MQL and vice versa, e.g.

On Tue, Aug 27, 2024, 12:53 p.m. Jeff Yemin @.***> wrote:

Nice!

For a production quality implementation, I'd be worried about using a different representation for structs (e.g JSON) than for top-level columns of tables or of top-level arrays. Looking forward to exploring the design space a bit further.

— Reply to this email directly, view it on GitHub https://github.com/NathanQingyangXu/jpa-mongodb-mapping/issues/26#issuecomment-2313070311, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6UYAQKLWCGGQMCYFSWLE3ZTSVIFAVCNFSM6AAAAABL3O2C4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJTGA3TAMZRGE . You are receiving this because you commented.Message ID: @.***>

NathanQingyangXu commented 1 month ago

Yeah, I saw JSON is not even a valid SQL type as STRUCT in java.sql.Types

On Tue, Aug 27, 2024 at 12:53 PM Jeff Yemin @.***> wrote:

Nice!

For a production quality implementation, I'd be worried about using a different representation for structs (e.g JSON) than for top-level columns of tables or of top-level arrays. Looking forward to exploring the design space a bit further.

— Reply to this email directly, view it on GitHub https://github.com/NathanQingyangXu/jpa-mongodb-mapping/issues/26#issuecomment-2313070311, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6UYAQKLWCGGQMCYFSWLE3ZTSVIFAVCNFSM6AAAAABL3O2C4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJTGA3TAMZRGE . You are receiving this because you commented.Message ID: @.***>

NathanQingyangXu commented 1 month ago

I think I've implemented the feature as per POC critieria at https://github.com/NathanQingyangXu/jpa-mongodb-mapping/pull/40

It implementes the following features:

array + struct ad-hoc combination
observe common JPA annotations like \@Column

NathanQingyangXu / jpa-mongodb-mapping

Support more complex embedded arrays #26