archesproject / arches-docs

official repo for Arches documentation
https://arches.readthedocs.io
9 stars 21 forks source link

Say "nodegroup_id" instead of "nodegroupid" #345

Closed kfogel closed 1 year ago

kfogel commented 1 year ago

This edits various places in the documentation where (I believe) we should recommend "nodegroup_id" instead of "nodegroupid" in JSON. When our team did a batch import of images guided by the documentation published at https://arches.readthedocs.io/en/7.3/import-export, the existing instructions to use "nodegroupid" in JSON didn't work; when we switched our JSON to use "nodegroup_id", it worked.

Still, I'm not 100% sure this change is correct and would definitely like some review. Once that review is done, if the change looks correct then I'll just amend this commit message before we merge.

Here are some more details:

In the Arches source code tree, both "nodegroupid" and "nodegroup_id" each appear plenty of times, though the latter more often:

Meanwhile, in the Arches documentation, both terms also appear -- though, curiously, in exactly reversed proportion!

As far as I can tell, what's going on is that within the database itself (e.g., in tables like 'cards', 'nodes', 'edit_log', 'tiles', and 'nodegroup') the column name is always "nodegroupid". However, in places where nodegroup IDs are to be accepted or expressed from/to textual sources (e.g., in JSON input or output), then "nodegroup_id" is generally used and, somewhere along the way, gets converted over.

(But again, I'm not 100% sure that this pattern is real. It's just an educated guess, and I'd appreciate more knowledgeable heads evaluating these statements.)

So for example, arches/db/dml/db_data.sql in the arches source code:

  INSERT INTO functions(functionid, modulename, classname,
                        functiontype, name, description,
                        defaultconfig, component)
      VALUES ('60000000-0000-0000-0000-000000000001',
              'primary_descriptors.py',
              'PrimaryDescriptorsFunction',
              'primarydescriptors',
              'Define Resource Descriptors',
              'Configure the name, description, and map popup of a resource',
              '{"module": "arches.app.functions.primary_descriptors",
                "class_name":"PrimaryDescriptorsFunction",
                "descriptor_types":
                 {"name": {"nodegroup_id": "", "string_template": ""},
                "description": {"nodegroup_id": "", "string_template":""},
                "map_popup": {"nodegroup_id": "", "string_template":""}} }',
              'views/components/functions/primary-descriptors');

(Line breaks added to fit commit message width constraints.)

Besides that conversion-from-textual-source pattern, "nodegroup_id" is also sometimes used in the Arches source code for variable names; e.g., search for it in tests/utils/label_based_graph_test.

But even then, there are still some things that puzzle me, given the above broad patterns (which, again, are highly tentative and I'm not sure I've correctly deduced them). For example, there's this line in arches/management/commands/load_jsonld.py:

  "nodegroupid": tile.nodegroup_id,

There's similar stuff in arches/app/media/js/viewmodels/tile.js, arches/management/commands/etl_template.py, and tests/importer/jsonld_import_tests.py, and other files.

Anyway, in this documentation change, I've left "nodegroupid" untouched in these files...

...because in those contexts it appears to be correct.

Standard checks:

kfogel commented 1 year ago

Hi, @ekansa! I think you might be the right person to review this? (Or if not, you probably know who would be...)

This PR is against master, but mainly I want to first just make sure the change is directionally correct. Once that's confirmed, I'm happy to do the somewhat fancier branch dance documented in the contribution guide to ensure that the eventual change ends up everywhere it should end up.

mradamcox commented 1 year ago

Hi @kfogel, first, I think you are correct about changing the JSON import schema--if nodegroup_id worked for you it sounds like there is a typo in the docs, and it actually makes sense to use nodegroup_id in that case, as you'll see...

There is indeed a very important difference between nodegroupid and nodegroup_id, but I understand your confusion and shared it for a while.

nodegroupid

This is an attribute stored as a UUID in the database. This is the style for many other ids, like nodeid, tileid, etc. In Arches, these ids are typically the primary key (.pk) for an instance of NodeGroup, Node, Tile, etc.

nodegroup_id

This is the id of a NodeGroup that is related to a given object, say a Tile, as a ForeignKey. As you see here: https://github.com/archesproject/arches/blob/master/arches/app/models/models.py#L1062, the TileModel doesn't have an attribute nodegroup_id, it just has nodegroup, which is a ForeignKey to the NodeGroup model. To acquire the id of the related NodeGroup through the Tile (as opposed to the NodeGroup object itself), you append _id, as in your example:

"nodegroupid": tile.nodegroup_id,

It wouldn't be possible to use tile.nodegroup here because that is a NodeGroup object (not just the UUID).

So, I have a feeling that the import JSON part of this PR can go through, but none of the other changes.

ekansa commented 1 year ago

Hello @kfogel and @mradamcox!

Thanks @kfogel for putting so much thought into this update, and to @mradamcox for providing more explanation about what's happening with nodegroupid versus nodegroup_id. I now see logic in the JSON expressions and the Django ORM. I think it'll be useful to explicitly document the rationale behind the *id versus *_id patterns.

@kfogel does the feedback from @mradamcox make sense to you? Do you need help in revising your branch before we more forward with a merge?

mradamcox commented 1 year ago

I've been reading a bit more up on this, and should add little more to my explanation above.

Where the TileModel defines a ForeignKey field called nodegroup, it stores that relationship in a database column called nodegroup_id. This is what you are getting when you see tile.nodegroup_id, the primary key id for the related NodeGroup, not the nodegroup itself. When you see tile.nodegroup, then Django uses the stored nodegroup_id to acquire the related NodeGroup and returns the instance itself.

I had been wondering, as I wrote the above comment, why one would use tile.nodegroup_id rather than tile.nodegroup.nodegroupid (brevity aside), but it's because the latter performs a second database lookup to get the nodegroup instance, where the former simply reads the stored nodegroup_id attribute of the tile.

ekansa commented 1 year ago

@mradamcox ! Thanks for the additional clarification. Yes, this also aligns with my own understanding about how Django works, and the tile.nodegroup.nodegroupid (the nodegroupid is the primary key for the nodegroup model: https://github.com/archesproject/arches/blob/stable/7.3.0/arches/app/models/models.py#L527) involves more overhead (a look up or a join, depending on the specifics) than simply working with the un-dereferenced foreign key: nodegroup_id

So basically, if I'm understanding this correctly, tile.nodegroup.nodegroupid and nodegroup_id are equivalent values but the former avoids extra database work of looking up and dereferencing nodegroup objects.

@kfogel do you want me to help make edits to your branch to update this PR?

kfogel commented 1 year ago

Wow -- thank you so much, @mradamcox for this very helpful feedback, and @ekansa for the clarifying questions / confirmations. I get what's going on with nodegroupid vs nodegroup_id now. I'll revise the branch and push to this PR in a moment. I won't rebase, as I think it's actually worth preserving this history; I'll just add a correcting commit, and will mention in the commit message that the correction is in response to this review.

It sounds like a second PR might be in order, summarizing the explanation above in the right place in the docs. I'll try to do that next.

kfogel commented 1 year ago

Okay, this PR is ready for review again, @mradamcox and @ekansa -- only the import/export doc changes remain.

kfogel commented 1 year ago

(By the way, I just now realized I should have named my branch master_nodegroupid_to_nodegroup_id; my bad. I'll use the master_-prefix convention in the next PR.)

kfogel commented 1 year ago

Thanks for the merge, @ekansa!