Scille / parsec-cloud

Open source Dropbox-like file sharing with full client encryption !
https://parsec.cloud
Other
260 stars 40 forks source link

[Server v3] Sort, cleanup and uniformise SQL queries #7122

Open vxgmichel opened 2 months ago

vxgmichel commented 2 months ago

A cleanup of the SQL queries could be useful, in particular:

Some queries are also duplicated in different component modules. Instead, one component should probably expose a method to be used by the other components.

Another improvement would be to replace all the occurrences of such pattern:

FROM realm_topic
WHERE realm = { q_realm_internal_id(organization_id="$organization_id", realm_id="$realm_id") }

by:

FROM realm_topic
JOIN realm ON realm_topic.realm = realm._id
JOIN organization ON realm.organization = organization._id
WHERE realm.realm_id = $realm_id
AND organization.organization_id = $organization_id

At the moment both approaches are used, so it's a bit confusing to navigate between queries.

Another suggestion: internal ids for organization, user and device could be extracted during the checks common to all authenticated commands (see #7119 ). The same could also performed for the internal realm id for the realm commands. This would simplify a lot of SQL queries like the example above. There was already an attempt at implementing this here: https://github.com/Scille/parsec-cloud/blob/11efa76470e7e173f1f60153f05370566e67eb92/server/parsec/components/auth.py#L46-L67

mmmarcos commented 2 weeks ago
  • check that consistent naming is used

Just to be sure, can you give an example of what you consider to be inconsistent naming?

  • check that selected columns are actually used by the code

I guess this requires manually scanning query by query. This may not be easy to detect and even if done, without an automatic check, we cannot be sure it we'll stay that way. So I'll say either:

  • see if some queries overlap

Some queries are also duplicated in different component modules. Instead, one component should probably expose a method to be used by the other components.

This seem to be the case. Maybe having queries in a single place (instead of having them in each component can help with this), what do you think?

Another improvement would be to replace all the occurrences of such pattern:

FROM realm_topic
WHERE realm = { q_realm_internal_id(organization_id="$organization_id", realm_id="$realm_id") }

by:

FROM realm_topic
JOIN realm ON realm_topic.realm = realm._id
JOIN organization ON realm.organization = organization._id
WHERE realm.realm_id = $realm_id
AND organization.organization_id = $organization_id

At the moment both approaches are used, so it's a bit confusing to navigate between queries.

I'm not yet familiar with the q_ queries, what is the actual query that its executed in the first approach?

Another suggestion: internal ids for organization, user and device could be extracted during the checks common to all authenticated commands (see #7119). The same could also performed for the internal realm id for the realm commands. This would simplify a lot of SQL queries like the example above. There was already an attempt at implementing this here:

https://github.com/Scille/parsec-cloud/blob/11efa76470e7e173f1f60153f05370566e67eb92/server/parsec/components/auth.py#L46-L67

You mean the organization, user and device are identified during authentication, which is performed before any other operation, and cannot change later during the actual operation. So you are proposing to store that information somewhere instead of performing SQL queries each time? BTW, from the code example, it seems that this information is already stored in the [Anonymous|Invited|Authenticated]AuthInfo. Am I missing something?

vxgmichel commented 2 weeks ago

Just to be sure, can you give an example of what you consider to be inconsistent naming?

For instance some queries are prefixed with _ and some are not:

q_freeze_user = Q(
[...]
_q_revoke_user = Q(
[...]

implement some sort of automatic check to detect unused columns (not sure how)

That sounds pretty hard to do.

This seem to be the case. Maybe having queries in a single place (instead of having them in each component can help with this), what do you think?

That was was the case in v2, but I don't think it was too convenient.

In the end you as a developer want your queries to be close to the methods using them, as you often go back and forth between the two. My suggestion instead is to split the queries by components and make sure that every component is responsible for a specific set of tables. Then it might expose methods to other components in order to access those specific tables.

At least that's my intuition, what do you think?

I'm not yet familiar with the q_ queries, what is the actual query that its executed in the first approach?

You can see what's generated by the Q micro framework using the sql attribute:

_q_insert_block = Q(
    f"""
INSERT INTO block (organization, block_id, realm, author, size, created_on, key_index)
VALUES (
    { q_organization_internal_id("$organization_id") },
    $block_id,
    { q_realm_internal_id(organization_id="$organization_id", realm_id="$realm_id") },
    { q_device_internal_id(organization_id="$organization_id", device_id="$author") },
    $size,
    $created_on,
    $key_index
)
"""
)
breakpoint()
[...]

(Pdb++) print(_q_insert_block.sql)

INSERT INTO block (organization, block_id, realm, author, size, created_on, key_index)
VALUES (
    (SELECT _id FROM organization WHERE organization.organization_id = $1),
    $2,
    (SELECT _id FROM realm WHERE realm.organization = (SELECT _id FROM organization WHERE organization.organization_id = $1) AND realm.realm_id = $3 ),
    (SELECT _id FROM device WHERE device.organization = (SELECT _id FROM organization WHERE organization.organization_id = $1) AND device.device_id = $4 ),
    $5,
    $6,
    $7
)

You mean the organization, user and device are identified during authentication, which is performed before any other operation, and cannot change later during the actual operation.

Yes. Also note the authentification would become part of the same postgre transaction if https://github.com/Scille/parsec-cloud/issues/7392 is implemented, so it would make even more sense.

So you are proposing to store that information somewhere instead of performing SQL queries each time?

Exactly.

BTW, from the code example, it seems that this information is already stored in the [Anonymous|Invited|Authenticated]AuthInfo. Am I missing something?

At the moment, those *_internal_id fields are neither set or used. They were introduced by @touilleMan when he updated the parsec server to v3 with the idea that they would be useful in the future.