cldf / csvw

CSV on the web
Apache License 2.0
36 stars 5 forks source link

Confusion about association table context #38

Closed Anaphory closed 4 years ago

Anaphory commented 4 years ago

In my pull request #37, I came to wonder again why csvw uses the context column approach, instead of defining separate association tables for every different context (consider EntryTable_SenseTable_Meaning instead of EntryTable_SenseTable with constant context "Meaning" for a use case close to the one I'm actually working on in CLDF). As you know, I have struggled with context before, did I get an explanation which I have just failed to internalize or have I never asked explicitly?

What actually are the use cases where one M:N relationship table that reflects two different relationships between different tables, differentiated by a context column, is better than having two different M:N relationship tables, with the context in the table name? (Or any other solution?)

I always have the prior that you, @xrotwang, have thought out the consequences of design decisions. So I expect there is a good reason so that I expect to become convinced that the current implementation is correct, but for the other case this is an important question: If it does make sense to change the behavior towards providing per-context association tables, do you think it would nonetheless be problematic to do so because the current API is now established?

xrotwang commented 4 years ago

The simple reason for this design choice (i.e. adding a context column to association tables) is that it allows to unify the handling of Source columns with that of "ordinary" list-valued foreign key columns. Source columns need context, because that's where the reference context goes (i.e. the stuff in square brackets in refs like kaping2019[pp.10-14]). Once having commited to context, it turned out that it can also be used handle the case of multiple list-valued foreign keys between the same two tables. There certainly are use cases of multiple relations between the same two tables, e.g. a loan relation will have two foreign keys to FormTable - one for source and one for target. So cases of multiple many-to-many relations between two tables don't seem to be a long shot, even though I don't recall one from the back of my head.

Anaphory commented 4 years ago

Sure – I fully understand what the context does for sources and I approve of that. I just still think that the semantics of the context column for an EntryTable_SenseTable is vastly different from an EntryTable_SourceTable.

I'm not saying two different relations between the same two tables is inconceivable, quite the opposite. We do have an example in CLDF, actually: The BorrowingTable has two relations to the FormTable, namely source form and target form. I can easily imagine people (ab)using it to describe borrowing events, which would have multiple targets, and unsure sources, described by listing the possible options. (This may not be wise, but that's a different question.) I wonder why it is better for the two different relationships to share the same association table, instead of two tables.

Let's use this borrowing table example. Say, I want to know all forms that were borrowed from Latin, and the year of that borrowing event.

If we solve this with context, we need (excuse my bad manual SQL, I hope I get my point across despite the mistakes)

SELECT FormTable.cldf_id, FormTable.cldf_languageReference, FormTable.cldf_value, BorrowingTable.date
  FROM FormTable, FormTable_BorrowingTable, BorrowingTable, FormTable_BorrowingTable as FormSourceTable_BorrowingTable, FormTable as FormSourceTable
  WHERE FormTable.cldf_id = FormTable_BorrowingTable.FormTable_cldf_id
  AND FormTable_BorrowingTable.context = "cldf_targetFormReference"
  AND FormTable_BorrowingTable.BorrowingTable_cldf_id = BorrowingTable.cldf_id
  AND FormSourceTable_BorrowingTable.BorrowingTable_cldf_id = BorrowingTable.cldf_id
  AND FormSourceTable_BorrowingTable.contexf = "cldf_sourceFormReference"
  AND FormSourceTable_BorrowingTable.FormTable_cldf_id = FormSourceTable.cldf_id
  AND FormSourceTable.languageReference = "lati1261"; 

With two tables (FormTable_BorrowingTable_cldf_sourceFormReference and FormTable_BorrowingTable_cldf_targetFormReference) that becomes

SELECT FormTable.cldf_id, FormTable.cldf_languageReference, FormTable.cldf_value, BorrowingTable.date
  FROM FormTable
  INNER JOIN FormTable_BorrowingTable_cldf_targetFormReference
    ON FormTable_BorrowingTable_cldf_targetFormReference.FormTable_cldf_id =
      FormTable.cldf_id
  INNER JOIN BorrowingTable
    ON FormTable_BorrowingTable_cldf_targetFormReference.BorrowingTable_cldf_id =
      BorrowingTable.cldf_id
  INNER JOIN FormTable_BorrowingTable_cldf_sourceFormReference
    ON FormTable_BorrowingTable_cldf_sourceFormReference.BorrowingTable_cldf_id =
      BorrowingTable.cldf_id
  INNER JOIN FormTable AS FormSourceTable
    ON FormTable_BorrowingTable_cldf_sourceFormReference.FormTable_cldf_id =
      FormSourceTable.cldf_id
  WHERE FormSourceTable.languageReference = "lati1261";

Isn't the second form less space-consuming on disk (no need to store a string for every pair), more space-consuming in the query (explicit INNER JOIN instead of taking the cross product and then thinning it down), and a query of the type databases should be optimized for and therefore faster?

xrotwang commented 4 years ago

Can't you simply add a "and context = ..." to the on clause of the join in your second statement and have this run on the current schema? Overall, I don't see this as convincing enough to introduce a backwards incompatible change.

Gereon Kaiping notifications@github.com schrieb am Di., 26. Mai 2020, 20:34:

Sure – I fully understand what the context does for sources and I approve of that. I just still think that the semantics of the context column for an EntryTable_SenseTable is vastly different from an EntryTable_SourceTable.

I'm not saying two different relations between the same two tables is inconceivable, quite the opposite. We do have an example in CLDF, actually: The BorrowingTable has two relations to the FormTable, namely source form and target form. I can easily imagine people (ab)using it to describe borrowing events, which would have multiple targets, and unsure sources, described by listing the possible options. (This may not be wise, but that's a different question.) I wonder why it is better for the two different relationships to share the same association table, instead of two tables.

Let's use this borrowing table example. Say, I want to know all forms that were borrowed from Latin, and the year of that borrowing event.

If we solve this with context, we need (excuse my bad manual SQL, I hope I get my point across despite the mistakes)

SELECT FormTable.cldf_id, FormTable.cldf_languageReference, FormTable.cldf_value, BorrowingTable.date

FROM FormTable, FormTable_BorrowingTable, BorrowingTable, FormTable_BorrowingTable as FormSourceTable_BorrowingTable, FormTable as FormSourceTable

WHERE FormTable.cldf_id = FormTable_BorrowingTable.FormTable_cldf_id

AND FormTable_BorrowingTable.context = "cldf_targetFormReference"

AND FormTable_BorrowingTable.BorrowingTable_cldf_id = BorrowingTable.cldf_id

AND FormSourceTable_BorrowingTable.BorrowingTable_cldf_id = BorrowingTable.cldf_id

AND FormSourceTable_BorrowingTable.contexf = "cldf_sourceFormReference"

AND FormSourceTable_BorrowingTable.FormTable_cldf_id = FormSourceTable.cldf_id

AND FormSourceTable.languageReference = "lati1261";

With two tables (FormTable_BorrowingTable_cldf_sourceFormReference and FormTable_BorrowingTable_cldf_targetFormReference) that becomes

SELECT FormTable.cldf_id, FormTable.cldf_languageReference, FormTable.cldf_value, BorrowingTable.date

FROM FormTable

INNER JOIN FormTable_BorrowingTable_cldf_targetFormReference

ON FormTable_BorrowingTable_cldf_targetFormReference.FormTable_cldf_id =

  FormTable.cldf_id

INNER JOIN BorrowingTable

ON FormTable_BorrowingTable_cldf_targetFormReference.BorrowingTable_cldf_id =

  BorrowingTable.cldf_id

INNER JOIN FormTable_BorrowingTable_cldf_sourceFormReference

ON FormTable_BorrowingTable_cldf_sourceFormReference.BorrowingTable_cldf_id =

  BorrowingTable.cldf_id

INNER JOIN FormTable AS FormSourceTable

ON FormTable_BorrowingTable_cldf_sourceFormReference.FormTable_cldf_id =

  FormSourceTable.cldf_id

WHERE FormSourceTable.languageReference = "lati1261";

Isn't the second form less space-consuming on disk (no need to store a string for every pair), more space-consuming in the query (explicit INNER JOIN instead of taking the cross product and then thinning it down), and a query of the type databases should be optimized for and therefore faster?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/cldf/csvw/issues/38#issuecomment-634201598, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKDN72EZUFBRPQXDHT3RTQDR5ANCNFSM4NKP2SLA .

xrotwang commented 4 years ago

Regarding space/performance considerations: we are dealing with data here that comes in via csv files of language data. I really don't see space on disk as being anywhere close to being a limiting factor here. Regarding speed, I'd like to see some profiling before committing to changes. Intuition isn't a good guide here, from my experience.

Robert Forkel xrotwang@googlemail.com schrieb am Di., 26. Mai 2020, 21:21:

Can't you simply add a "and context = ..." to the on clause of the join in your second statement and have this run on the current schema? Overall, I don't see this as convincing enough to introduce a backwards incompatible change.

Gereon Kaiping notifications@github.com schrieb am Di., 26. Mai 2020, 20:34:

Sure – I fully understand what the context does for sources and I approve of that. I just still think that the semantics of the context column for an EntryTable_SenseTable is vastly different from an EntryTable_SourceTable.

I'm not saying two different relations between the same two tables is inconceivable, quite the opposite. We do have an example in CLDF, actually: The BorrowingTable has two relations to the FormTable, namely source form and target form. I can easily imagine people (ab)using it to describe borrowing events, which would have multiple targets, and unsure sources, described by listing the possible options. (This may not be wise, but that's a different question.) I wonder why it is better for the two different relationships to share the same association table, instead of two tables.

Let's use this borrowing table example. Say, I want to know all forms that were borrowed from Latin, and the year of that borrowing event.

If we solve this with context, we need (excuse my bad manual SQL, I hope I get my point across despite the mistakes)

SELECT FormTable.cldf_id, FormTable.cldf_languageReference, FormTable.cldf_value, BorrowingTable.date

FROM FormTable, FormTable_BorrowingTable, BorrowingTable, FormTable_BorrowingTable as FormSourceTable_BorrowingTable, FormTable as FormSourceTable

WHERE FormTable.cldf_id = FormTable_BorrowingTable.FormTable_cldf_id

AND FormTable_BorrowingTable.context = "cldf_targetFormReference"

AND FormTable_BorrowingTable.BorrowingTable_cldf_id = BorrowingTable.cldf_id

AND FormSourceTable_BorrowingTable.BorrowingTable_cldf_id = BorrowingTable.cldf_id

AND FormSourceTable_BorrowingTable.contexf = "cldf_sourceFormReference"

AND FormSourceTable_BorrowingTable.FormTable_cldf_id = FormSourceTable.cldf_id

AND FormSourceTable.languageReference = "lati1261";

With two tables (FormTable_BorrowingTable_cldf_sourceFormReference and FormTable_BorrowingTable_cldf_targetFormReference) that becomes

SELECT FormTable.cldf_id, FormTable.cldf_languageReference, FormTable.cldf_value, BorrowingTable.date

FROM FormTable

INNER JOIN FormTable_BorrowingTable_cldf_targetFormReference

ON FormTable_BorrowingTable_cldf_targetFormReference.FormTable_cldf_id =

  FormTable.cldf_id

INNER JOIN BorrowingTable

ON FormTable_BorrowingTable_cldf_targetFormReference.BorrowingTable_cldf_id =

  BorrowingTable.cldf_id

INNER JOIN FormTable_BorrowingTable_cldf_sourceFormReference

ON FormTable_BorrowingTable_cldf_sourceFormReference.BorrowingTable_cldf_id =

  BorrowingTable.cldf_id

INNER JOIN FormTable AS FormSourceTable

ON FormTable_BorrowingTable_cldf_sourceFormReference.FormTable_cldf_id =

  FormSourceTable.cldf_id

WHERE FormSourceTable.languageReference = "lati1261";

Isn't the second form less space-consuming on disk (no need to store a string for every pair), more space-consuming in the query (explicit INNER JOIN instead of taking the cross product and then thinning it down), and a query of the type databases should be optimized for and therefore faster?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/cldf/csvw/issues/38#issuecomment-634201598, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKDN72EZUFBRPQXDHT3RTQDR5ANCNFSM4NKP2SLA .

Anaphory commented 4 years ago

Okay, that is a very sensible objection. I'll do some testing which option works and whether there is a difference that might make this worth considering. Although I'm still not very good at SQL, so if you know someone better to draft benchmarks, that may be the better way.

But to me it still seems like the inertia of having adopted the current solution is the major factor in favor of keeping it. Thinking ‘what if we had a clean slate now’, I still think something like Table1_Table2_ColumnName might have been better than Table1_Table2 with context="ColumnName". Do you not see the two types (source vs. non-source) of using context as vastly different, like I currently do? Do you consider the current way we do it the straight solution instead of a hack to make things work while re-using exisfing structures, and can you explain that reasoning to me?

xrotwang commented 4 years ago

Well, the way I see relational databases, different types of data go into different tables. Your proposal would go against this. We'd have different tables with exactly the same type of rows: two foreign keys to the same two tables.

But yes, the "inertia" is certainly there. And I think that's not a bad thing. I see a maintainer as a guardian of the expectations of existing users at least as much as a visionary in search of the next version 😀

Gereon Kaiping notifications@github.com schrieb am Di., 26. Mai 2020, 22:03:

Okay, that is a very sensible objection. I'll do some testing which option works and whether there is a difference in speed that might make this worth considering. Although I'm still not very good at SQL, so if you know someone better to draft benchmarks, that may be the better way.

But to me it still seems like the inertia of having adopted the current solution is the major factor in favor of keeping it. Thinking ‘what if we had a clean slate now’, I still think something like Table1_Table2_ColumnName might have been better than Table1_Table2 with context="ColumnName". Do you not see the two types (source vs. non-source) of using context as vastly different, like I currently do? Do you consider the current way we do it the straight solution instead of a hack to make things work while re-using exisfing structures, and can you explain that reasoning to me?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/cldf/csvw/issues/38#issuecomment-634249521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKGN3OKF77L2DIG5L2LRTQOCNANCNFSM4NKP2SLA .

xflr6 commented 4 years ago

I have not run the query, but AFAIU the following is equivalent to the first query above (as in general inner joins with conditions are equivalent to cartesian products with conditions in the whereclause).

SELECT
  FormTable.cldf_id,
  FormTable.cldf_languageReference,
  FormTable.cldf_value,
  BorrowingTable.date
FROM FormTable
JOIN FormTable_BorrowingTable
 ON FormTable.cldf_id = FormTable_BorrowingTable.FormTable_cldf_id
 AND FormTable_BorrowingTable.context = "cldf_targetFormReference"
JOIN BorrowingTable
  ON FormTable_BorrowingTable.BorrowingTable_cldf_id = BorrowingTable.cldf_id
JOIN FormTable_BorrowingTable as FormSourceTable_BorrowingTable
  ON FormSourceTable_BorrowingTable.BorrowingTable_cldf_id = BorrowingTable.cldf_id
  AND FormSourceTable_BorrowingTable.context = "cldf_sourceFormReference"
JOIN FormTable as FormSourceTable
  ON FormSourceTable_BorrowingTable.FormTable_cldf_id = FormSourceTable.cldf_id
  AND FormSourceTable.languageReference = "lati1261"

Hope that helps.