backdrop / backdrop-issues

Issue tracker for Backdrop core.
144 stars 39 forks source link

[DX] Revert field storage back to Drupal 6 style normalized tables #56

Open timonweb opened 10 years ago

timonweb commented 10 years ago

The biggest OMG when you look at Drupal 7's db at first is: "why there are so many tables?". There were lots of debates regarding Drupal db schema, but if backdrop's target is small to mid projects and better performance, there should be a movement towards getting rid of current "table per field" approach.

There can be something similar to Drupal 6 CCK approach: table per bundle and if field is multivalued it lives in a separate table. There was a project called "PBS" (https://drupal.org/project/pbs) that tried to accomplish that goal but it didn't work out for some reasons.

So what do you things about such a feature?

rudiedirkx commented 10 years ago

That would mean completely reverting the Entity principle. I'm pretty sure nobody wants that.

timonweb commented 10 years ago

no no. Drupal 7 has storage backend API that is swappable, so this is a matter of writing a custom storage backed, something similar to what https://drupal.org/project/mongodb did. They rewritten fields storage backed and it plays nicely with Entity API.

alexweber commented 10 years ago

@timonweb I think the bigger question here is do we want to allow fields to be shared amongst entities. In D7 it was possible and got dropped in D8... how Fields are used will play a fundamental part in how we architect the db structure to accommodate them...

rudiedirkx commented 10 years ago

In D7 it was possible and got dropped in D8...

Eh!? I need to check up on D8... How does D8 store its field values?

davereid commented 10 years ago

So...in another words, D6 CCK is the goal? I believe D8 will store it's field values in even more tables if you used to share fields across different entity types.

timonweb commented 10 years ago

maybe the title is a bit confusing, but I just wanted to tell about the idea when we store fields in one common table per content type /bundle until field becomes multivalue. Something like Drupal 6 CCK did.

jcisio commented 10 years ago

It won't simplify thing but make it more complicated. What is the problem with too many tables? JOINs are not expensive if they are done correctly.

rudiedirkx commented 10 years ago

D6's data storage was very strange: 1 table per content type with all fields, except if it's a multi value field, then partly in a different table. It might be faster, but it is strange. No forcefully reusing field settings though. D8's is useless, because it still reuses fields within entity type.

rudiedirkx commented 10 years ago

We can't lose fieldability, so what would a page comment table look like? comment_page_data? And how would you create a View of News and Event nodes? UNION?

quicksketch commented 10 years ago

The D7 database structure was one of those areas that I thought would be damaging to Drupal 7's release. The D6 structure was a perfectly normalized database structure: one table for single-value fields per content type and a separate table for any multi-value or shared fields. It's the kind of structure you'd use if you were designing a database from scratch, without a system in front of it.

It was simplified for a reason though, the code to move data between different tables was pretty expensive and slow. It didn't put your site into offline mode (but should have), because it would pretty much lock your entire site while thousands of rows of data were being moved around. So in the end, the D6 structure was dropped because it was dangerous, hard to implement, and required the end-user to know about database practices in order to build an efficient structure. Though the end-result of "everything is just inefficient" that we ended up in D7 wasn't really the best result either... :(

Although I'm not recommending we necessarily take this approach... one method that could at least save the proliferation of tables and possibly be more efficient (in some situations) is to use one (or two actually) tables per field type, rather than field definition. So you'd have one table for text, one for links, one for ints, one for files, one for images, etc. It keeps the number of tables down, but makes bigger tables overall. It's an idea, but I'm not sure it'd be better for performance.

damienvancouver commented 10 years ago

The straightforward schema structure of D6 allowed for easy troubleshooting / data conversion by direct SQL queries. The ability to do this easily was totally lost in D7 with the one field per table approach. While joins are just as fast, writing any kind of ad-hoc SQL is now a hellish process.

It would be really really nice to be able to do a SELECT * query for an entity's fields in ad-hoc SQL without having to write something monstrous like this (to pluck my most recent example from my notes file).

/* I just want to select some fields from the supporter entity type for a data conversion query... */
FROM citcon_supporter s
   LEFT OUTER JOIN field_data_field_citcon_email e on s.sid = e.entity_id 
 LEFT OUTER JOIN field_data_field_email e2 on s.sid = e2.entity_id 
 LEFT OUTER JOIN field_data_field_dsf_email e3 on s.sid = e3.entity_id 
 LEFT OUTER JOIN field_data_field_citcon_opt_in o on s.sid = o.entity_id  
 LEFT OUTER JOIN field_data_field_citcon_first_name f on s.sid = f.entity_id   
 LEFT OUTER JOIN field_data_field_citcon_last_name l on s.sid = l.entity_id   
 LEFT OUTER JOIN field_data_field_last_name_as a on s.sid = a.entity_id  
 LEFT OUTER JOIN field_data_citcon_plugin_data_process p on s.nid = p.entity_id   
 LEFT OUTER JOIN node cn on s.nid = cn.nid   
 LEFT OUTER JOIN tfc_dsf_action_old.combined_supporter_language_lookup ll on cn.language = ll.Langcode 
 WHERE o.entity_type is null or o.entity_type = 'citcon_supporter' )    
 AND (e.entity_type is null or e.entity_type = 'citcon_supporter')   
 AND (e2.entity_type is null or e2.entity_type = 'citcon_supporter')  
 AND (e3.entity_type is null or e3.entity_type = 'citcon_supporter')
 AND (f.entity_type is null or f.entity_type = 'citcon_supporter' ) 
 AND (a.entity_type is null or a.entity_type = 'citcon_supporter')    
 AND (l.entity_type is null or l.entity_type = 'citcon_supporter') 
 AND(p.entity_type is null or (p.entity_type = 'node' and p.bundle = 'citcon_campaign'))  

Whatever schema is decided upon, it would be great if it was usable by humans as well as procedural middleware.

sun commented 10 years ago

one method that could at least save the proliferation of tables and possibly be more efficient (in some situations) is to use one (or two actually) tables per field type, rather than field definition. So you'd have one table for text, one for links, one for ints, one for files, one for images, etc. It keeps the number of tables down, but makes bigger tables overall.

That is - more or less - the database design of Profile module, which does not scale. But if scalability is not a goal, then that is certainly a simple and viable storage approach.

alexweber commented 10 years ago

@rudiedirkx In D8 fields can only be shared amongst bundles of the same entity, I remember reading that somewhere :)

mkalkbrenner commented 10 years ago

I remember a D6 project where we had to track who modified a "field" to which value. Instead of implementing a lot of code based on node revisions, we simple created some database triggers to log value changes to a log table. This solution was light wight and the performance was very good because we were able to turn off drupal revisions completely. The downside was that if someone turned a cck field into a shared one, the triggers broke.

The same thing happens if someone writes scripts or modules that work with the database directly instead of using APIs. This could happen if you have to do things at an early stage of the drupal bootstrap where not all APIs are available. CCKs dynamic table adjustments were the reason why we removed CCK support from ThemeKey 6.x instead of fixing the reported bugs. (In fact we had to wait until EntityFieldQuery became available later in the drupal 7 live cycle to provide this feature again for D7.)

So I vote against a "Drupal 6 CCK" design of shared tables. The schema has to be predictable at runtime without to much investigation when you can't use all of the APIs.

jenlampton commented 10 years ago

I'd also be in favor of returning the the Drupal 6 / cck style field storage. When new people look at the D7 database structure it does seem to be a huge WTF, where D6 just made sense.

I think if we stick to loading an entire entity at a time (or multiple entities like D7 does) and also have have entity caching working on load (see https://github.com/backdrop/backdrop-issues/issues/74) then there's a potential here for a real performance gain.

I vote for a move back towards sanity, and if we get a performance gain too, great! :)

jcisio commented 10 years ago

CCK style field storage is even more WTF when mixing between mono/multi value fields. Entity cache is another issue and I don't see why it is related to database schema.

mkalkbrenner commented 10 years ago

From a contrib developer point of view it would be great if D8 and backdrop would share the same database schema!

But if we decide to go back to an "all fields of an entity in one table approach" we must avoid to split shared fields in a separate table to avoid real time copying of data if a user decides to share a field. If we need the feature of tables for shared fields, we should use database views instead. Creating and modifiying db views is much faster than creating/droping tables and copy data around!

jenlampton commented 10 years ago

Say you have a node with 10 fields on it, 5 are shared, and 5 are stand-alone.

In Drupal 6, loading 1 of these nodes would have taken 6 field queries. In Drupal 6, loading 10 of these nodes would have taken 60 field queries. In Drupal 6, loading 100 of these nodes would have taken 600 field queries.

In Drupal 7, loading 1 of these nodes would have taken 10 field queries. In Drupal 7, loading 10 of these nodes would have taken 10 field queries. In Drupal 7, loading 100 of these nodes would have taken 10 field queries.

You can see that the initial load of a node is a lot slower in D7, because the new schema always requires more queries. But loading multiple nodes got much faster, because we added node_load_multiple() - a super smart decision. Now, let's say we go back to a D6-cck-style schema but keeping node_load_multiple():

Proposed, loading 1 of these nodes would have taken 6 field queries. Proposed, loading 10 of these nodes would have taken 6 field queries. Proposed, loading 100 of these nodes would have taken 6 field queries.

It looks like a win-win to me, especially if we can eliminate the WTF for people checking out the schema.

@mkalkbrenner I don't think we're going to try to keep the Backdrop schema in line with Drupal 8. It is a possibility, however, that the field storage wouldn't change from how it is in Drupal 7. Would that be more sane to you (as a contrib module developer) than reverting back to how it was in Drupal 6?

My main focus here is on DX, and our primary target market are 1) existing contrib developers and 2) people who are new to both Drupal & Backdrop. I opened this issue because of my experience with people in group 2 - liking CCK field storage (because it's perfectly normalized). It's important to hear from group 1 though too!

@jcisio caching is related to the database schema only (tangentially) because bad database architecture can have a huge impact on performance. As can caching.

tedbow commented 10 years ago

@jenlampton I am in group 1. I would say keeping the system the with the Drupal 7 would be saner for me.

I think explaining the system as it is in D7 is simpler because you don't have to explain that fields are stored differently if they have 1 vs many values. For me it would seem like the WTF when people checking out the schema would be from people looking at the CCK method and seeing 2 different ways to store fields.

Are there any advantages to the D7 method for changing a field from single to multi value after it has a lot data rows. I just did a little test to make sure this is possible in CCK(it's been a while). So it moves the existing data in the single table for per content type method to the table per field method. Seems like this could be costly when you have a content type with thousands of rows and then you want to make one(or more) field multi-value.

mkalkbrenner commented 10 years ago

It is a possibility, however, that the field storage wouldn't change from how it is in Drupal 7. Would that be more sane to you (as a contrib module developer) than reverting back to how it was in Drupal 6?

@jenlampton the question is not what structure is easier to understand for newbies or contrib maintainers. @tedbow describes the problem with the D6 CCK structure exactly:

Seems like this could be costly when you have a content type with thousands of rows and then you want to make one(or more) field multi-value.

That needs to be avoided!

Basically I see two possibilities to have an understandable db structure without performance issues when modifying fields:

  1. Store all fields of an entity in one table per entity, dedicated and shared fields. (modified D6 CCK)
  2. Store each field in a separate table. (mostly D7)

@jenlampton thanks for these numbers:

In Drupal 6, loading 1 of these nodes would have taken 6 field queries. In Drupal 6, loading 10 of these nodes would have taken 60 field queries. In Drupal 7, loading 1 of these nodes would have taken 10 field queries. In Drupal 7, loading 10 of these nodes would have taken 10 field queries. Proposed, loading 1 of these nodes would have taken 6 field queries. Proposed, loading 10 of these nodes would have taken 6 field queries.

With my proposed solution 1 we will have

Proposed, loading 1 of these nodes will take 1 field query. Proposed, loading 10 of these nodes will take 1 field query.

For solution 2 we can leverage database views (algorithm=merge) we can achieve that, too:

Proposed, loading 1 of these nodes will take 1 field query. Proposed, loading 10 of these nodes will take 1 field query.

On more thing we must be aware of is field translation. This concept didn't exist in D6. If we add the language as part of the primary key to any table based on a D6 layout, we will end up in a lot of redundancy of data for non-translated/language-neutral fields.

So my suggestion is solution 2. Mostly stay with the D7 approach and verify if we can ease our code by using db views without a loss o performance. (The view could be extended to include the node table as well.)

rudiedirkx commented 10 years ago

Not to dis anyone, but I think D8's method's the weirdest. It does reuse fields, but only within entity type... So config is fixed where you need it most dynamic (nodes!) and still creates different tables for same-config field_image in Node, User and Comment etc!

I'm good with D6 method and D7 method, but both require a very good, very simple field--db API, because direct field queries are very dangerous either way (especially in D6 obviously).

bojanz commented 10 years ago

Doing contrib work in D7 quickly shows the advantages of D8's no-field-sharing system.

D7 forces you to prefix your fields with the name of the entity type, because the fields are global. So my commerce_license entity needs to have a "status" field, then the field is called "commerce_license_status", and I need to acces it using $license->commerce_license_status. That gets old really fast, and in some cases (like commerce_license_billing) it forces you to shorten field names just to get under the field name length limit.

In D8 it is simple. A field belongs to its entity type. I get $license->status, and I'm good to go.

Besides, a "status" field that's on the node and the user at the same time has completely different semantics in most cases. Sharing fields across entity types makes very little sense from a data modeling perspective.

This is also a by-product of the unification of entity storage, a huge rewrite and step forward that barely happened after a lot of effort by many contributors. So let's not dis the things we don't understand.

I am also unsure why that is even mentioned here, when the field sharing limitations have very little to do with the actual way of database storage. The community was free to revive PBS for D7, nobody except dawehner and me tried (and we gave up when we realized how little benefit there is to it). That's telling you something.

mikemccaffrey commented 10 years ago

Personally, I think it would be great if we could eliminate field sharing altogether. Recreating an independent field for a different content/entity type takes a few extra minutes, but it eliminates any questions of whether a change to the field will effect all or just one of the instances.

jcisio commented 10 years ago

Field sharing could be useful to do some query across entity types/bundles (like the Taxonomy module keeps a taxonomy_index table to keep the aggregate query for multiple node types easier). But I think at the end, it does not worth that for a ugly structure. If we want to do something special (query across entity types), then denormalize and/or cache the data is a good alternative.

maxpearl commented 10 years ago

I feel mixed. W/o field sharing, sometimes views across content types, especially with contextual filters, can get really ugly. And consistent image fields makes theming easier (or so I'm told, I'm not a themer.)

That said, if the architecture is cleaner, I'd be fine with getting rid of it.

And I should have just said +1 to @jcisio's comment. :-)

rudiedirkx commented 10 years ago

@bojanz If you want every entity to have a 'status' field, don't make it a Field. Make it a base table column. I'm thinking about Nodes, because those are the most used. It would be better to have no field sharing there, because every node type might have a slightly different field_owner or field_images.

I still think D7's the most sensible and workable. Think EntityFieldQuery. I could live with any version if the abstraction layer is good enough (not extensive and super abstract, but good).

ericfg commented 10 years ago

jcisio: Joins are very expensive when compared to doing complex queries on wide tables. You can not index for joins, where you can create special multi-field indexing if we store everything in a table as d6 did. I have worked on drupal projects that have required a massive increase in hosting costs when moving from d6 to d7 because of the fact that all fields are separate tables (and therefor the custom indexing we had done to increase performance had to be replaced by throwing more processors and memory at the problem)

jcisio commented 10 years ago

@ericfg If only you query a lot of data. To make JOINs less expensive, you should do a simply query to get a list of (not too many) IDs first, then do JOINs with those IDs as condition, like what does entity_load_multipl(), which is efficient. Never do JOINs across huge tables only to get a few rows of result.

ericfg commented 10 years ago

if you have a large set of where clauses, you have an expensive query if you have to do it all with joins, no matter how you break it into separate chunks as you describe. However, we're talking about using drupal and not writing our own queries. In d6 it was possible to index for large datasets and reduce query time in a way that is 100% impossible in d7 due to the schema change

jcisio commented 10 years ago

Hmm, I don't mean to say separate into chunks. If you have a large set of WHERE clause, and ORDER etc, just fetch the entity IDs in the first query and do a second complex JOINs query with those IDs (I repeat what I've just said above). It is NOT writing manual queries.

ericfg commented 10 years ago

In a relational database such as mysql you can not index for joins, that is not up for debate. Whether there are strategies that can in some cases mitigate the impact of that reality is not what I was intending to discuss here, but thanks for the discussion -- it is interesting. A well designed schema allows for creative indexing as the system gets more complex. There could be valid arguments for disregarding that, but I guess you can tell which side of this I fall on.

ericfg commented 10 years ago

rudiedirkx: I have to disagree with you, in my opinion there is nothing "strange" about having all fields in one table unless the field is intended to store multiple values, it seems to me that is consistent with the basic concepts of proper schema design. D6's schema was inline with what someone who knows databases might design (and therefor easier to debug and explore via the sql console); d7s schema would get an F if submitted as a project in nearly any database design class.

rudiedirkx commented 10 years ago

@ericfg You're right of course. Nobody would design a schema with 1 table per field. I was so used to D7's model, I got carried away. D6's is the most logical. How would you list News and Event nodes in 1 View though, if they come from 2 different tables with some common fields?

ericfg commented 10 years ago

off the top of my head would a UNION query do the job? will think about a more detailed answer both from a drupal 6 style schema perspective and a generic "what if we were designing this from scratch" perspective and comment again soon

mkalkbrenner commented 10 years ago

The advantage of storing every field in a separate table and loading them separately (not using joins) is that you potentially can store a field in a different storage like Mongo DB or the file system. Personally I can think of some use cases for that, but I don't know if this scenario is targeted by backdrop. BTW this discussion is a good example that neither drupal nor backdrop can provide a good "one-size-fits-all" solution.

quicksketch commented 10 years ago

The advantage of storing every field in a separate table and loading them separately (not using joins) is that you potentially can store a field in a different storage like Mongo DB or the file system.

The database structure is irrelevant to the ability to use an alternate storage system. In fact there was even a project that made Field module use the D6-style storage in D7: https://drupal.org/project/pbs. No matter which field storage you use, the entire object is loaded all at once on node_load(), but then if you need to do a search or view, you need to use an Entity Field Query (EFQ), rather than querying the database with SQL.

That said, I think having alternate storage backends for fields is probably squarely in the category for 1% use-cases. The https://drupal.org/project/mongodb module (which provides a Mongo-DB storage for fields), is used on about 300 out of almost a million Drupal installs (less than 0.001%). If removing this abstraction gets the 99% a significant performance gain, it's a good candidate for removal.

off the top of my head would a UNION query do the job?

I'm pretty sure this was benchmarked in the D7 cycle, where it was found that joining the tables to make a single query instead of one-per-table actually ended up being slower than the separate queries.

mkalkbrenner commented 10 years ago

The advantage of storing every field in a separate table and loading them separately (not using joins) is that you potentially can store a field in a different storage like Mongo DB or the file system.

The database structure is irrelevant to the ability to use an alternate storage system.

But not if you use a single table for all fields. That prevents storing a single field in a different storage.

But as already mentioned above, I agree to remove these kind of edge case features.

rlnorthcutt commented 10 years ago

Since we are looking at simplicity, ease of use, efficiency, and building for the 80%, then perhaps we do away with shared fields entirely. A single entity has a single table. Since they are in separate tables, fields could have the same name though, which could mimic shared fields and possibly make querying across multiple entities possible.

Also, as a colleague just mentioned to me, as long as we have a decent cache of current entity values (like entity cache?), then the performance issues can drop off quickly. At that point, we just need to update the cache when the entity is updated, and make sure that the DB schema is simple/sane.

The only other thing might be to have a cache for entity view modes (teaser, full, etc.), but that could be in place of or an extension of entity cache. You save an entity, it updates the entity table, regenerates the cache for that specific entity, and the you are good to go. Hooks can alter entities either before they are saved or before they are loaded from cache, whichever you need.

quicksketch commented 10 years ago

Thanks for the extra feedback here. I think the discussion around options for simplifying are great, even if we're not explicitly talking about reverting to the D6 style.

A single entity has a single table.

I'm not sure this would be the case, we still have to deal with multiple value fields, which would require a separate table.

quicksketch commented 9 years ago

This isn't going to happen in 1.x, so moving to 2.x.

jlfranklin commented 7 years ago

Before I start, I want to say I'm just making some mental notes here. I'm not advocating any specific suggestions.

When I got into Drupal in the D6 era, I already had a strong background in OOP. I had an expectation that content types would be like OOP objects. I thought I'd be able to build a "Person" content type as a base for "customer" and "staff" (for example). The shared field model was an interesting twist, and while it offended some of my pure OOP sensibilities, I had to admit it was pretty flexible. Whatever we move to, I'd like to preserve this flexibility.

Since then, my biggest problems building sites has been that I had two ways I could manage working with nodes. The first option is calling node_load() and node_save(), which requires every module that might care to run their Node API hooks, triggering unnecessary SQL queries.

The other option is to run my own SQL calls, leaving out all the contributed modules, including some that might be relevant. Sometimes this means I might not have the latest values because an external process updates the node or user (think: LDAP.) Sometimes this means selecting nodes that a node access module would filter out.

I've wanted it both ways, call node_load(), but tell Drupal to skip the stuff I don't need. This excluding hooks that don't need to be called would improve DB performance, but at the expense of a much more complicated node system. Off the cuff, I'm thinking of some kind of hook_node_call_me_when() that returns field names or types of operation or something that lets node_load() mask out certain calls. Of course, any calls that are masked out need to be noted in the node object so node_save() doesn't delete or zero-out data that was never loaded, or can be loaded later when another call to node_load() doesn't mask out the field.

mikemccaffrey commented 7 years ago

I'd like to propose that we close this issue and adopt a policy that database schema changes should only happen to existing tables in module update hooks. Normalization might look more tidy, but it seems crazy to be regularly changing the structure of common tables like node in a production database that is under load. The table changes would be a blocking operation that could take down the entire site for minutes, and if something goes amiss in the middle of the operation it might be impossible to recover.

ghost commented 4 years ago

I am no kind of a database/(My)SQL expert, but I have some questions that may help contribute to this discussion:

  1. If we forget about Drupal 6, 7 and 8, and forget about Backdrop, and we pretend we're building a brand-new, modern CMS from scratch, what database schema would we use? I wonder if trying to fit into current, or old, ways of doing things makes us miss opportunities that might otherwise be available...
  2. How does WordPress do it? Can we learn anything from them in this regard? We like to look to WordPress to see how they do things as a popular CMS, though in this situation it may not be as useful as I believe WordPress' popularity comes from their (non-technical) users and not necessarily their developers...
damienvancouver commented 4 years ago

How does WordPress do it? Can we learn anything from them in this regard? We like to look to WordPress to see how they do things as a popular CMS, though in this situation it may not be as useful as I believe WordPress' popularity comes from their (non-technical) users and not necessarily their developers...

Chiming in on the Wordpress stuff... Before Drupal's corporate takeover by venture capitalists, I made my living coding Drupal sites. Now I make my living optimizing Wordpress and Woocommerce sites that are hamstrung by a couple of critical flaws in its oversimplified DB schema.

We definitely must NOT follow Wordpress's example. WP was developed as a simple blogging platform and its DB schema never evolved beyond that for some confusing reason I don't get.

So in WP there are basically three tables doing everything, all the data is jumbled into these:

wp_posts: analagous to node - post ID, title, author, body etc. wp_postmeta: holds "metadata" for a post, this can be anything and is where most of the good stuff lives, including field storage when defined with "advanced custom fields" or by PHP code. wp_options: analagous to variable but used for so much other stuff (scheduling cron runs, for example)

The problems are that all data on the entire site is just crammed into these tables, and then further is not indexable at all by the database. For a simple blog site, sure. For a simple brochure site with some advanced custom fields defined, OK... still not a problem.

For wp_postmeta (where almost anything one might want to search lives in the meta_value column), the postmeta values are simply key/value pairs. Because meta_value is defined as a SQL LONGTEXT type instead of a VARCHAR type, SQL is unable to index it. Which means none of your data is indexed. Which means every query looking for data ("find me the orders belonging to customer named "Alice Bob") has to scan the ENTIRE wp_postmeta table.

Now recall that ALL the site data is jammed in this table. so maybe you are only looking for a certain post type that is really only 1% of your data, so really you only would NEED to scan 1000 rows of your 1,000,000 row postmeta table to get the result you want! Too bad, because they are all jumbled together, enjoy your 30 second wait for each page view............

So Wp is a huge lesson in what NOT to do. The takeaways there are threefold:

  1. Do not put all post types in one table. that's dumb when you need to search and know you only want one post type but are forced to search the entire table. Even with good indexing you cannot avoid searching the entire table in some cases (for example if you do a wildcard search with a * at the start... e.g. WHERE meta_value LIKE '%jane%' would search for jane . sometimes this can't be avoided and when it can't you don't need to be pointlessly searching all the rows in the site.

  2. Never have searchable data that can't be indexed. If like in WP's case it must have a LONGTEXT value, have two fields: meta_value_indexable VARCHAR(255), meta_value_full LONGTEXT . that way searches could at least happen on the first 255 chars, using indexes, and then the performance wouldn't suck (as long a syou don't want that leading wildcard as mentioned last point).

  3. Split metadata into different tables. The same arugment applies for the individual metadata info, that should be in a table only owned by the contnet type you are searching.

Drupal 6 did the best job of this, IMO.. it nailed all 3 of the above points properly and it also was possible to use SQL to query it, if you know how.

Drupal 7 went way too far down the rabbit hole with abstraction leading to a confusing schema with no real ability to query it realtime in SQL. For a giant Woocommerce store with thousands of orders, or an intranet site with thousands of individual objects.. huge problem!
And all apparently just to avoid adjusting the DB schema (tacking on new data fields) as the site grows. Which is stupid IMO because adjusting SQL schema is not a big deal and should not be avoided for reasons of superstition/ignorance.? SQL is awesome, people need to use it more instead of thinking a wheel reinvention is the answer (it is not).

So anyway.... let's learn from WP by not copying what they did but instead not making those dumb-ass mistakes. We'd be taking major steps back from D6 to do that though so it's not very likely!

sun commented 4 years ago

The WordPress database table schema (post + postmeta) is very simple but yet flexible. It is easy to learn and query, even manually. In essence, it is a document store.

In fact, WordPress would be predestined to use a NoSQL backend. Interestingly, most websites (regardless of WordPress or Drupal) are using one of many available document stores to power their on-site search anyway, because they yield faster results. If there weren't so many plugins issuing custom SQL queries (instead of its query builder), I would have tried to run WordPress with a document store already. For the same reasons, I would recommend exactly that for Backdrop, or at least a compatible storage method when considering to change it.

An often overlooked but noteworthy detail of WordPress's meta table is that there can be multiple values (rows) per key. This enables multi-value support in many cases where data would have to be serialized otherwise. The meta storage exists and works identically for all core entities (terms, users, comments).

The performance problem has been resolved with recent versions of MySQL/MariaDB as they support to add a partial index on longtext columns.


However, if I interpret the recent change of the issue title correctly, then all of these considerations are off-topic here, as this issue is specifically about reverting to the special table schema and behavior of CCK in D6; i.e., fields are added as columns to the table of the content type, unless they are (a) multiple or (b) translatable, in which case they get an own table only containing the field.

I heard there were some problems when trying to change the field configuration / schema after the fact, requiring the column to be automatically migrated into an own table, but I never experienced them myself. At minimum, this schema / behavior was (also) easy and somewhat predictable. The query builder and Views had to be aware of the field definitions at all times though; these dependencies made their maintenance quite complex. Skipping the query builder would lead to custom database queries that suddenly break the site upon changing configuration.

klonos commented 4 years ago

Thanks @BWPanda for "stirring" this issue, and thank you @damienvancouver and @sun for your insightful comments 👍