culturecreates / artsdata-pipeline-action

Action to manage the data pipeline for Artsdata
The Unlicense
0 stars 0 forks source link

Design a reusable Action #1

Open saumier opened 8 months ago

saumier commented 8 months ago

Design a reusable workflow that handles website crawling and posts to Artsdata Databus. This workflow should use a Ruby action to keep workflows DRY. It should be usable by anyone external to Culture Creates wanting to add JSON-LD to Artsdata.

Lets start with a document that describes the ideal solution. https://docs.google.com/document/d/1dXJtTytgLWl4uq9LUa-cdW6SPCfRRGBPwMyWB-3v9fE/edit

Github references: https://docs.github.com/en/actions/creating-actions/about-custom-actions

Epic User Story

I am a webmaster of a website with cultural events. I would like to get my events into Artsdata on a schedule that I can set. Based on feedback reports from Artsdata, I would like to fix issues either on my website or in the data pipeline.

  1. Schedule push using Workflow run command
  2. Generate a list of webpages: sitemap or by building a simple crawler with some 'directives' to collect the urls
  3. Use my publisher credentials for the Artsdata Databus
  4. Run a test validation to see issues with my data before loading into Artsdata.
  5. See my events in Artsdata and check reports on my data quality after they are loaded into Artsdata
  6. Link my places, performers and organizers to Artsdata URIs
  7. Setup my event types as controlled vocabulary
  8. Map my event types to the Artsdasta Event Type controlled vocabulary
  9. Save all versions of my website data dump in Github

Phase 1

Create a standalone Action in a new repo that will post to Artsdata Databus. Versioned. Refactor all workflows that currently POST to the artsdata databus. https://github.com/culturecreates/artsdata-pipeline-action

Phase 2

Create a new version of the Action to offer different modes like 'fetch', 'push' and 'fetch-push'

Phase 3

Create a new version of the Action to include OpenRefine CLI mode.

Ideas

Features:

saumier commented 7 months ago

@dev-aravind Reminder from our discussion today, try to find the url to send to the Artsdata Databus that points to the specific version committed to Github. The Databus stores the DownloadUrl of each published artifact version, and expects to receive the same data for a specific artifact version when using its downloadUrl. This is approach treats data like versioned software.

dev-aravind commented 7 months ago

@saumier The first pass of the workflow is up. Please review it while I make further changes.

saumier commented 7 months ago

@dev-aravind In the Friday standup I think you said you were still working on this. That you were trying to use this reusable workflow from the IPAA repo and that it was committing the file this repo instead of the IPAA repo.

With this issue I am looking to centralize the workflows so we don't have to repeat them in each repo and for each website. The goal is to be DRY. This does not mean everything has to be in one large workflow. There can be parts of the workflow that call other parts. The design is important here. We currently have several repos like artsdata-planet-ipaa, artsdata-planet-scenefranophones, artsdata-planet-nac, artsdata-orion, and even artsdata-planet-footlight that Suhail is working on, and all are doing some similar steps such as posting to the Artsdata Databus. Having multiple repos is needed so we can collaborate with different organizations.

So I propose we discuss changing the approach and start with a single action for publishing to the Artsdata Databus that all repos can use. Let's discuss after the standup on Monday if you are available. Sound good?

dev-aravind commented 7 months ago

@saumier Yes, we can do that.

saumier commented 7 months ago

@dev-aravind I created a new repo called https://github.com/culturecreates/artsdata-pipeline-action/

I think this should be a javascript action. What do you think?

Take a look at these steps to create a javascript action.

https://docs.github.com/en/actions/creating-actions/creating-a-javascript-action

dev-aravind commented 7 months ago

@saumier We can proceed with a Javascript action. I'll research on this and update you. For now I'm just tring to create a workflow that can push into artsdata.

dev-aravind commented 7 months ago

@saumier I've created a reusable workflow in the repository you created. I need the publisher URI to be accessible to the repository to test it out. Also if we need to use it in other repositories, we may need to release a version for it.

saumier commented 7 months ago

@dev-aravind I shared my publisher uri secret to this repo. Please go ahead and publish the action so it can be used in other repos.

dev-aravind commented 7 months ago

@saumier Assigning this to you as we are now using the artsdata pipeline workflow in orion to import data.

saumier commented 7 months ago

@dev-aravind This is a great start, but I think we can make it even simpler.

Lets look at how we can simplify the Databus API parameters:

You currently have:

 - name: Action setup
        uses: culturecreates/artsdata-pipeline-action@v1.0.0
        with:
          artifact_name: levivier-ca
          page_url: https://levivier.ca/fr
          publisher_uri: "${{ secrets.PUBLISHER_URI_GREGORY }}"
          download_uri: https://raw.githubusercontent.com/culturecreates/artsdata-orion/main/output/levivier-events.jsonld
          download_file: levivier-events.jsonld
          group: artsdata-orion

The database has the following required parameters: [:publisher, :group, :artifact, :version, :downloadUrl, :downloadFile]

So the simplest I can think of is:

 - name: Action setup
        uses: culturecreates/artsdata-pipeline-action@v1.0.0
        with:
          artifact: levivier-ca
          publisher: "${{ secrets.PUBLISHER_URI_GREGORY }}"
          downloadUrl: https://raw.githubusercontent.com/culturecreates/artsdata-orion/main/output/levivier-events.jsonld

The Action would automatically complete the other required blank parameters if they are not provided: group: --> from the repo name downloadFile: --> from the file at the end of the download URL version: --> already doing this so no change.

Optional parameters can be left blank unless they are provided: comment: --> optional comment on the artifact shacl: --> optional SHACL file to validate against reportCallbackUrl: --> optional webhook to report results back to

For the naming, lets keep the same names rather than making new ones. So lets standardize on camel case which is used by the Artsdata Databus and keep the names the same including downloadUrl which has "l" not "i" from URI because the download must be a URL (uniform resource locator). A URI does not need to actually be "locatable" but the download does.

Check out the POST /databus documentation here https://documenter.getpostman.com/view/3157443/TVep7mv3

Let's discuss further before we make any more changes to your Action.

I want to write a user story that I can then demo.

dev-aravind commented 7 months ago

@saumier I've updated the workflow to reduce the number of arguments and also standerdize them.

saumier commented 7 months ago

@dev-aravind I don't think the latest script accepts optional parameters which need to be sent when they are not default, like the group or comment or shacl parameters.

I have an analysis question for you: can the action replace the POST part of the workflows for all our current uses without changing the groups and artifacts? I will write more background and user stories so this can be designed to cover all our real needs.

dev-aravind commented 7 months ago

@saumier The workflow now also accepts optional parameters and sets some of the parameters like file name, group automatically if the user doesn't provide them.

saumier commented 7 months ago

@dev-aravind Lets proceed to the next iteration of the Action. Can you propose a way to add the ability to do the crawling for cases like scenefrancophones and vivier?

Without doing the code, can you think about the design. Document the inputs to the Action that would complete the following "fetch-and-commit-data" job that you wrote. Open a design doc if you like, or document in this issues. I am fine either way.

  fetch-and-commit-data:
    runs-on: ubuntu-latest
    outputs:
      commit-hash: ${{ steps.get_commit_hash.outputs.commit-hash }}

    steps:
    - name: Checkout Repository
      uses: actions/checkout@v4

    - name: Set up Ruby
      uses: ruby/setup-ruby@v1
      with:
        bundler-cache: true
    - name: Run Main Script
      run: |
          bundle exec ruby src/main.rb \
          https://reseauartactuel.org/liste_evenements/page/ \
          "h3.tribe-events-list-event-title a" \
          output/rcaaq-events.jsonld \
          true 

    - name: Commit and Push Changes
      run: |
        git config --local user.email "actions@github.com"
        git config --local user.name "GitHub Actions"
        git pull
        git add "output/rcaaq-events.jsonld"
        git commit -m "Add data generated by the script"
        git push

    - name: Get commit hash
      id: get_commit_hash
      run: |
        commit_hash=$(git rev-parse HEAD)
        echo "commit-hash=$commit_hash" >> $GITHUB_OUTPUT
dev-aravind commented 7 months ago

@saumier I've made a document for version 2 of the Artsdata Pipeline Action. You can find it here.

saumier commented 7 months ago

Excellent. I added some user stores. Please revise your design to support them,

dev-aravind commented 7 months ago

@saumier The updated workflow is up now. Please check it out and let me know. Doc

saumier commented 7 months ago

@dev looking good but I need a bit more time to review and add more details to the user story.

saumier commented 2 weeks ago

@dev-aravind Lets do phase 2. I think the timing is really good and we have enough examples to make this very useful. Thx.

saumier commented 3 hours ago

@dev-aravind Can you please explore docker-container-actions https://docs.github.com/en/actions/sharing-automations/creating-actions/creating-a-docker-container-action so we can reuse the ruby code. Thx.

Take a look at the Artsdata Pipeline Action v2 because there are a couple of comments assigned to you.