GSA / datagov-ckan-multi

Other
10 stars 6 forks source link

Discovery: CKAN organizations linked to publishers list #540

Closed anuveyatsu closed 3 years ago

anuveyatsu commented 3 years ago

User Story

Related to: issue 481

Acceptance Criteria

anuveyatsu commented 3 years ago

Analysis

Questions:

Option 1: Storing publishers in "group_extra" table

This option can sound natural and it'd also ease the way we retrieve data via API, eg, by using organization_show API.

When adding publishers to an org, we need to use "extras" parameter. Since extras must be "key: value" pairs and CKAN requires keys to be unique, we cannot have multiple same level publishers as below (only one publisher will be saved in the db):

[
  {"key": "publisher_1", "value": "first level publisher 1"},
  {"key": "publisher_1", "value": "first level publisher 2"}
]

We could store multiple same level publishers as a comma separated value. Disadvantage is that it is easy to make a mistake + if publisher name has a comma it must be escaped:

[
  {"key": "publisher_1", "value": "first level publisher 1, first level publisher 2, etc"}
]

Another option here would be to store publishers as stringified array which we can then parse in the frontend. It might sound hacky but it might still be an option:

[
  {"key": "publisher_1", "value": "[\"first level publisher 1\", \"first level publisher 2\"]"}
]

Some of the limitations of using extras:

Option 2: Storing publishers in a JSON or CSV file

Similar to what we already have but with the addition of related CKAN organization so that we can filter list of publishers:

[
  {
    "ckan_org": ["ckan_org_name", "another_ckan_org_name"],
    "publisher": "Department of Agriculture",
    "publisher_1": null,
    "publisher_2": null,
    "publisher_3": null,
    "publisher_4": null,
    "publisher_5": null
  },
  {
    "ckan_org": ["ckan_org_name"],
    "publisher": "Department of Agriculture",
    "publisher_1": "Agricultural Marketing Service",
    "publisher_2": null,
    "publisher_3": null,
    "publisher_4": null,
    "publisher_5": null
  },
  {
    "ckan_org": ["ckan_org_name"],
    "publisher": "Department of Energy",
    "publisher_1": "Office of Environmental Management",
    "publisher_2": null,
    "publisher_3": null,
    "publisher_4": null,
    "publisher_5": null
  }
]

Note that ckan_org property is a list so that a publisher can relate to multiple ckan orgs.

This option would require managing list of publishers using JSON file (or CSV/Excel so we can convert to JSON later). This option assumes the list isn't modified frequently.

Using JSON file to store/manage publishers is the cheapest option right now. However, every time a change is made, the app must be redeployed as list of publishers is compiled into the metadata app's code.

A workaround would be to store JSON file somewhere outside of the compiled code so that it can be accessed via HTTP hence no deployment is required after the changes in list of publishers. However, it increases chances of having human-error, eg, breaking JSON structure etc.

Option 3: Publisher object

Since the publishers have a tree structure + root node has a one-to-many relationship with CKAN orgs (eg, a publisher can be related to many orgs), we should consider creating a separate table in the db.

As any tree data can be stored in the relational database, and depending on how we need to retrieve the data we can optimize the schema and API. Here is the article about storing and retrieveing tree data using python/django which we can refer when implementing it in CKAN:

https://medium.com/@spybugg/storing-and-retrieving-tree-structures-in-relational-databases-using-python-django-7480f40c24b

Questions:

Option 4: Organization hierarchy

If each publisher could be an organization in CKAN, we could use the following extension to have hierarchy between them: https://github.com/ckan/ckanext-hierarchy

anuveyatsu commented 3 years ago

Hi @adborden could you take a look at my comment above?

adborden commented 3 years ago

I think we'd prefer to manage the publishers list as a CSV (google sheet) and then occasionally update it and push it into inventory.data.gov. I think really what we're after would be a combination of options 1 and 2 with a ckan command to update group_extras from a CSV file.

Questions:

  • If user is a member of multiple organizations, should she/he have all publishers or only for selected organization?

I think you would only see publishers for the selected organization. I think that makes the most sense.

  • I'd say filter by selected org, however, it complecates logic, eg, what if user wants to select publisher prior to selecting org? Should we display all available publishers? Should we display message saying "first select an org"?

We should be setting org selector to the first available value so that it is never empty. In most cases, it will be the only organization the user is a member. In the unlikely case a user can find themselves with an un-selected org, a message is fine asking them to select an org, first.

  • I guess most users of Inventory app only have access to 1 organization so it might not be even an issue.

Right, it's unlikely to come up so the simplest fix to avoid an error is preferred.

Option 1: Storing publishers in "group_extra" table

Yes, I'm leaning to this as the best option. I like that this also is exposing publisher data via the API. Is there a limit to how much data we can store in the extras?

Some of the limitations of using extras:

  • Managing publishers that have tree structure with this approach will be challenging. It is not clear how to define parent-child relationship between the publishers (nodes);

I was thinking of using a JSON encoded array of arrays. Essentially it's a CSV, but each row is converted to an array.

[
  {"key": "publisher", "value": "[[\"first level publisher 1\"],[\"first level publisher 1\", \"first level publisher 2\"]"}
]
  • List of publishers for each CKAN org must be done by a curator or similar which is tedious since "extras" needs to be defined per each org and then written to CKAN via API;

This is somewhat of an advantage because the API to update the publisher's list already exists. We don't have to create a UI or new API to manage this.

  • It's easy to make a human-error.

Yes.

Option 2: Storing publishers in a JSON or CSV file

This option is okay. It allows us to easily write validation logic into the build step to avoid human error. It also makes it easy to keep a single CSV of publishers and occasionally update the app.

Option 3: Publisher object

I think this is more than we want to invest in for a feature we're only sort-of confident about. Right now we're looking for a lower level of effort so we can see how folks will use publishers and then maybe identify what this solution should look like. This option requires UI, API, and database work.

Option 4: Organization hierarchy

This is interesting, but I don't want to tie the publishers to user management. Also, I imagine this is a headache to move orgs around if the hierarchy of publishers needs to change.

anuveyatsu commented 3 years ago

@adborden great, thanks for the comment. Let me confirm the flow:

Is there a limit to how much data we can store in the extras?

I don't think Postgres has any practical limit for single column value. I found some information about it here - https://wiki.postgresql.org/wiki/FAQ#What_is_the_maximum_size_for_a_row.2C_a_table.2C_and_a_database.3F

adborden commented 3 years ago

Here's a partial example for USDA which is represented in multiple organizations on inventory.data.gov.

organization,publisher,publisher_1,publisher_2,publisher_3,publisher_4,publisher_5
agricultural-marketing-service-department-of-agriculture,Department of Agriculture,Agricultural Marketing Service,,,,
ars-usda-gov,Department of Agriculture,Agricultural Research Service,,,,
aphis-usda-gov,Department of Agriculture,Animal and Plant Health Inspection Service,,,,
risk-management-agency-department-of-agriculture,Department of Agriculture,Departmental Management,,,,
usda-gov,Department of Agriculture,Office of Chief Information Officer,,,,
usda-gov,Department of Agriculture,Economic Research Service,,,,
usda-gov,Department of Agriculture,Farm Service Agency,,,,
usda-gov,Department of Agriculture,Food and Nutrition Service,,,,
usda-gov,Department of Agriculture,Food Safety and Inspection Service,,,,
usda-gov,Department of Agriculture,Foreign Agricultural Service,,,,
usda-gov,Department of Agriculture,National Agricultural Statistics Service,,,,
usda-gov,Department of Agriculture,National Institute of Food and Agriculture,,,,
usda-gov,Department of Agriculture,Natural Resources Conservation Service,Colorado State University,,,
usda-gov,Department of Agriculture,Rural Development,,,,
usda-gov,Department of Agriculture,GIPSA,Federal Grain Inspection Service,,,
usda-gov,Department of Agriculture,Natural Resources Conservation Service,,,,
usda-gov,Department of Agriculture,US Forest Service,,,,

For import, I imagine this could be a ckan command done by data.gov operators:

$ ckan --plugin=ckanext-dcatus_metadata publishers-import path-to-publishers.csv

And that ckan command would just loop over the rows to identify organizations, then for each organization, encode each publisher row into an array and save that array of arrays in group_extras of the organization.

# pseudo code
for organization in get_organizations_to_process(csv_file):
  publishers_tree = []
  for row in csv_file:
    if row['organization'] == organization.id:
      publishers_tree.append(row.values())

  organization.append_group_extra(key='publisher', value=json.dumps(publishers_tree))
  organization_update(organization)
anuveyatsu commented 3 years ago

Thanks @adborden I think this looks clear now. I believe next step is to create an implementation ticket with all tasks identified in this thread.