Data/Schema scoping - Githubissues

valadas commented 3 years ago

Currently, all schemas and the data they contain are global for the whole instance.

I believe the solution only makes sense if we can at least be able to scope it by portal. A PortalId field here would do it for the data but not for the schema. One site may want a Contact table that has 3 fields and no relations but the other site also want's a Contact table but with 10 fields and 3 relations. Those schemas would be incompatible.

For other uses cases, the instance owner may want to have a schema/data that is global for all portals and would hate this feature.

One solution to implement this without breaking the global option would be to add an optional _portalId to the table names.

So Contacts would be global for the instance, but Contacts_1 and Contacts_2 would be scoped per portal.

This leaves us with just the naming conflict between the global and portal specific ones... I am not sure what is the best way to solve this one but I think that since it's all WebAPI driven, the APIS could have a portalId in their request, if not provided they would serve the global one but if provided would serve the portal specific one.

X3Technology commented 3 years ago

100% agree with your proposed approach for table naming. The user needs to be able to specify whether the ContentType is scoped by portal or global. It's easy enough to handle this requirement within the database table naming, but I am concerned about the impact to the WebAPI endpoints .

Currently, the URL to get all Contacts looks like "/api/content/contacts".

If the request is coming from DNN, there is a portal context and it is easy to grab the PortalId and then point at the correct table. using the schema proposed above.

But, what if the request is coming from outside DNN (thinking headless CMS here)? Then, there would be no portalId in the request context.

I think the only solution would to append a portalId querystring parameter to the URL, something like "/api/content/contacts/portalid/1", which I don't love, but I'm not sure there is better option.

I am also wondering how the authentication layer might impact this. For example, if a user authenticates and is only allowed access to PortalId=3, then it would simple to point his request of "/api/content/contacts/" to the 'Contacts3' table. But what if he authenticates with a superuser account and has access to all the Contacts tables. Which one to point to then?

valadas commented 3 years ago

But, what if the request is coming from outside DNN (thinking headless CMS here)? Then, there would be no portalId in the request context.

It's still possible if the proper portal alias is used I believe... But I think I would prefer to be explicit because one may need the global one and someone else the portal specific one.

For authentication we definitely need to make that proper, I guess a host can access anything but and admin or other user would only be able to access the global data or the portal specific data just for that portal but not specific to other portals. And to make it even more complex, it may need to be different for reading and writing :) We certainly will have some fun in that are with the permission grid 😄

X3Technology commented 3 years ago

Yes I think you are right, If the portal alias is in the URL, we should be able to get the PortalId in the API request context.

Regarding the global table concept, I am also wondering about how to solve the use case where one portal's data could be shared with other portals.

In fact, maybe its not even necessary to have "global" tables. Maybe it's as simple as just making one portal's version of the table accessible to everyone.

So, a user could set everything up under portal 0, then use the permissions system to make it accessible up to all portals. Basically, using the Roles defined in the Roles table (including the global -1, -2, -3 roles). Thus, the user would be able to select specific roles from each portal and assign the proper VIEW, EDIT, etc. permission from the Permissions table. Do we need new StructuredContent API specific permissions?

As for the API route formatting, here is my thought: '/api/content/contacts' would default to the portalId determined by the context of the request (derived from portal alias).

However, all API calls would have an optional 'portalId' parameter that would override the context portalId and thus point to the table specific to the querystring portalId

So: '/api/content/contacts/portalId/3' would point to Portal 3's version of the Contacts table (necessary permissions still need to apply)

I think that's the cleanest solution and still allows us to have individual portal versions as well as shared (global) versions.

valadas commented 3 years ago

" Regarding the global table concept, I am also wondering about how to solve the use case where one portal's data could be shared with other portals.

I think we could handle this using portal groups, there is a field called "EffectivePortalId" if I recall correctly which give the root portal for that group no matter which portal in the group you came from, and then that could be used to scope multi-portal data and permissions

WillStrohl commented 3 years ago

Am I reading this correctly... When you add a new data type for a new instance of structured content, there's a new data table created in the database?

david-poindexter commented 3 years ago

Am I reading this correctly... When you add a new data type for a new instance of structured content, there's a new data table created in the database?

@hismightiness yes, when a new Content Type is created, a corresponding custom table is created to house its data.

WillStrohl commented 3 years ago

Thanks for the clarification, David. I'm not a fan of that way of doing this, but whatever the community decides is what we'll go with. :)

X3Technology commented 3 years ago

Out of curiosity, what are your objections?

Sent via the Samsung Galaxy S20 5G, an AT&T 5G smartphone Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Will Strohl @.> Sent: Monday, September 20, 2021 6:52:04 PM To: DNNCommunity/Dnn.StructuredContent @.> Cc: Jay Mathis @.>; Comment @.> Subject: Re: [DNNCommunity/Dnn.StructuredContent] Data/Schema scoping (#2)

Thanks for the clarification, David. I'm not a fan of that way of doing this, but whatever the community decides is what we'll go with. :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/DNNCommunity/Dnn.StructuredContent/issues/2#issuecomment-923411801, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAMDENKBTCF2ZVRQFTPZ3ADUC63JJANCNFSM5EHXSBUQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

WillStrohl commented 3 years ago

This could lead to a bloating of schema, increases the chances of orphaned data, and also presents a challenge for some of our clients that have to audit and report any schema change to their security team - regardless of how it happened.

X3Technology commented 3 years ago

Ok, so I read three objections: 1) Bloated Schema 2) Orphaned Data 3) Schema Documentation

I must assume that since you are opposed to native tables for the custom data, you would be in favor of an EAV module similar to the core UserProfile table schema or some other structured content solutions like 2Sexy (or others).

1) Bloated Schema. I don't see how the superuser (or permissioned user) having the ability to create custom tables is really is any different than the same user having the ability to install multiple custom third party modules which will install their own custom tables. If the organization doesn't want people to "bloat the schema" by creating custom tables, they can 100% turn off that ability to anyone that doesn't have the proper authority.

Further, at the end of the day, the data has to go somewhere. If it is in a custom table like this proposes or shoved into a generic EAV model, the data has to live somewhere. When you have native tables, at least the table's name gives some context to the kind of data it contains. A generic EAV table could contain anything and everything.

2) Orphaned Data. The whole premise of a relational database is to capture how things, you know, relate, to each other. Enforcing referential integrity at the database level (which this solution does) ensures that orphaned records cannot happen. Even if you tried, you couldn't create an orphaned record, because the database will prevent the delete from ever happening. If you stuff all this same data into an EAV model, the referential integrity has to be enforced at the business logic layer rather than the database level (where it belongs). It is entirely possible that a user could go directly into an EAV table and manually delete a parent record without remembering/knowing to also manually delete it's child records. The argument of orphaned data actually supports the native table model, not the EAV model.

3) Scheme Documentation. I don't have any clients with these stringent requirements, but this feels like a business requirement not a technical one. My guess is the vast, vast majority of our users don't have these requirements and we shouldn't make broad technical architecture decisions based on the business requirement of a small minority of our users. Further, there is nothing preventing the admin user from documenting the tables created in this system just as they would if they installed a custom third party module.

As food for thought, let me offer these benefits that you may not have thought about.

1) Database Optimization / Performance Using native database tables, each table can be optimized to exactly the type of data that is holds. In an EAV schema, each piece of data must be stored in a generic string field (probably and nvarchar(MAX)) and then magically converted into the proper data type in the business logic layer at run time. This has several problems. First, the conversion process back and forth from string to whatever datatype the field is supposed to be is inherently not as optimized as using native datatypes. Second, it's entirely possible that someone could manually manipulate the database (import/export?) and accidentally put in a value that the business layer won't know how to handle (like putting a date into what is supposed to be a boolean field). Third, at the storage level, putting a bit value (or a integer value or a date value) into an string field is equally non-optimal. Booleans belong in a bit field, integers belong in a byte field, etc., etc. Columns have datatypes for good reason and part of that is so the data can be optimally stored and indexed for query performance.

As an aside, I actually started this project using an EAV model. I built the entire thing to work using an EAV model; it works great. I then tested it by creating a data object that had 20 fields of various data types and loaded it with 1 million records of fake data. I then modeled the same object using native tables that could be optimized for the data it contained. I then tested the performance of querying data from both models and the performance results were not even close. The EAV model had to do 20 inner joins in order to construct the results. It was night and day. The discrete table is nothing more "Select * from XXX". No joins, no data conversions. It's not even close. One end goal for this system is to be the starting point for a headless CMS. We need a system that is optimized at the lowest possible level for performance, not a house built on sand. EAV defeats the entire point of using a relational database in the first place.

2) Ease of Use Have you ever tried to build a report that exports the UserProfile table? Or a raw SQL query that gets data out of an EAV model solution? It's not easy to build (or maintain). In fact, I don't even use the UserProfile table on any of my projects anymore. If a client needs the UserProfile functionality, I build a custom table to model exactly what they want. It's easier to understand and maintain and use.

3) Portability What if the client wants to import some existing data into the data model? Or export it to some other system? Or build a custom third party module on top of the data? Good luck if your data is in an EAV table schema... The EAV model is only useful within the context of the solution implementing it. Relative to custom tables, it is extremely difficult to use the data anywhere other than the solution it belongs too.

Now, don't get me wrong. EAV absolutely has its place, but it isn't in a solution that's primary purpose is to manage potentially large volumes of data. I fully intend to build a FormBuilder that will reuse most of the concepts found in this solution, and I will use the EAV model for that solution. However, the primary use case for a forms module is much different than a structured content solution. First, the expected data population of a forms module is much less than that found in structured content scenarios. So, performance is not nearly the concern for a forms generator solution as the primary database I/O is inserting singleton records and the occasional data dump for a report. A general use structured content solution needs to be able to handle high-traffic read operations over the entire dataset and return results as quickly as possible. Further, in a forms module scenario, native tables just seem like overkill. Second, importing/exporting or using the data outside the context of the module is also not as much of a priority. In that use case, the EAV model is absolutely a valid choice. But for what we proposing and how it potentially could be used, the EAV model is just not the right choice.

WillStrohl commented 3 years ago

Great discussion, Jay. Thanks for taking the time to write out your thoughts.

RE: Bloated Schema
When installing an extension that brings new schema with it, it's a known set of schema changes. With most clients, they would have already identified these dependencies and know they're going to happen in production before they occur, and it's usually not an overwhelming number of tables.

I have two clients I already know could never use this design because a new data type or something would trigger a security event because a new table was created by a content editor. Now, the site has to undergo an audit related to the content changes. Installation and content updates are two different levels of risk that are directly related in this scenario. In fact, if this is the pattern that's rolled out, I already know I'll need to uninstall this as part of each upgrade to prevent potential security events later.

RE: Ease of Use
I agree. Your points would be a trade-off if that approach were taken.

RE: Portability
I disagree with this point entirely as a con. We have a module developer feature called IPortable that literally uses this pattern to make content and any other serializable data portable. It's simply a matter to have a well-defined schema and logic pattern.

Your points about scale are good ones as well. There are ways to address this as well.

By the way, I'm not saying this needs to change. I'm just adding dialog based on our own clients.

DNNCommunity / Dnn.StructuredContent

Data/Schema scoping #2