Closed trieloff closed 4 years ago
thanks for the write-up @trieloff
as for your Approach 1, Bullet point 3, i think there are a range of serverless implementations that could be used, here is an example for firebase: https://codeburst.io/graphql-on-cloud-functions-for-firebase-153fe7b02ea5
thanks for writing up the document! i will take a look once i am relieved from finals :> @trieloff @davidnuescheler
Also we could look into https://www.apollographql.com/
Apollo is great, but you have to run it yourself, so we might have to find a way to build an Apollo-to-OpenWhisk adapter.
i was trying to set up a simple graphQL (apollo) + firebase service and it seems manageable with hardcoded data into the firebase. here is the github repo https://github.com/craeyu/firebase-apollo-demo and i will send the credentials to my firebase on slack with you two. (put it inside functions
folder)
now i am thinking about looking into git hooks and bots for parsing briefly - feel free to let me know your concerns and suggestions. thanks you!
In the research phase, can we also take at https://developer.github.com/v4/ (GitHub GraphQL API) ... ? or maybe someone already knows the limitations...
As far as I understand, is it possible to stack GraphQL APIs, so that we can have the Helix GraphQL API that calls the GitHub GraphQL API for anything file, branch, commit, or user-related and uses the Firebase backend for anything that looks inside the files.
The same would be possible with AppSync, we'd just have to write resolver functions that call the GitHub GraphQL API.
I am trying to understand a bit more about how graphql works, and I am afraid I am having a bit of a hard time finding graphql examples of more complex queries. Could someone provide examples for this?
E.g. what would the following query look like:
/^[Hh]ackaton:.*Brussels/
?hi @koraa! in that case, you would need to write specific query and resolvers for them. here are some queries examples that are more complicated: https://graphql.org/learn/queries/ for example, if i only want the title of the markdowns that were authored by "boris@example.com" can be
{
markdowns(author: "boris@example.com") {
title
}
}
and similarly if i want the same query items but also the markdowns' content i can do
{
markdowns(author: "boris@example.com") {
title
content
}
}
and for regex support we can leverage gatsby, here are some examples of gatsby + graphQL https://www.gatsbyjs.org/docs/graphql-reference/ for the second point, we would need to figure out the schema for our markdowns. what fields should one markdown contain, do we want to support timestamps for the last modifications, etc. once we have that field in db and schema we can run queries on it.
right now we do not have clear opinion on how the schema is going to look like yet - so depends on the need we might need to tweak things as we go. :)
Hiho @craeyu :)
Thanks for your explanation!
So if I understand you correctly, we would have to build custom functions for each query we run? So GraphQL is more like a very nice & fancy way of describing rest apis, not a replacement for SQL or something like the mongodb query language?
I was hoping there would be some generic/built in support for query operators (like in SQl, or the $regex, $gt, $lt, ... operators); this would give us the advantage of being able to support a lot of complex queries out of the box including those our customers come up with on their own. In the classical graphql development model we would have to manually write an endpoint/graphql function for each type of query we would want to support. In the code of the endpoint we would manually write our DB query in SQL (or whatever DB we would be using)? Did I get that right?
Maybe we could write a generic GraphQL –> DB query language translation layer with the full feature set. Is graphql powerful enough to build a complex querying grammar with it? I haven't found support for json or key/value mappings, so the only thing I can think of is using plain strings:
{
markdowns(author: "{\"$regex\": \"^boris\"}") {
title
content
}
}
Although in order to support more complex queries like "or" we would probably take the entire query as a string? (I've saved myself the trouble of quoting properly this time).
{
markdowns(query: '{$or: [{author: {$regex: "^boris"}}, {modified_at: {$gt: "1 Jan 2019 0:00 CEST}]') {
title
content
}
}
So this would basically end up being a "mongodb query lang over GraphQL" protocol. We would create a couple of graphql functions (find, findOne, ...) for each collection in our db; the functions would just convert the payload to bson and pass it along to mongodb. Of course we would miss a lot of functionality very important fo good performance (creating indexes, cursors for iterating over long query results, batch operations to name just a few), we would introduce some extra latency & complexity and we would have to think about how our middleware would handel authentication.
I've found something like this for postgres+graphql (https://github.com/graphile/postgraphile), but that one seems to implement quite a small subset of possible SQL queries…
@craeyu @trieloff What do you think about graphql as a db query lang? Is there a better way to do it :)
newest update - running graphQL with azure and postgres db. for simplicity reasons right now i am using an auto incrementing integer for primary key id
but if we decided to use SQL instead of NoSQL database, i think we should use a more meaningful key. @stefan-guggisberg suggested we can use the url for md. i plan to look into cosmos db today since it is NoSQL.
elloo! @koraa yup REST API and GraphQL serve the same purpose, just GraphQL has some small changes that potentially make developer's experience better in terms of building and consuming APIs. here are some of the differences and similarities: https://blog.apollographql.com/graphql-vs-rest-5d425123e34b
if we were to manually set up everything from scratch then at the end of the day we would need to write to and read from db with queries. but we can leverage existing and well documented and supported gql libraries such as apollo
and relay
to avoid doing so ourselves.
we can build the regex with gatsby-node.js
and then feed the whole regex to a simple gql query like this:
query MarkdownByAuthor($authorRegex: String!) {
allMarkdowns(filter: {frontmatter: {author: {regex: $authorRegex}}}) {
edges {
node {
frontmatter {
title
author
}
}
}
}
}
for the or query you are talking about, i am thinking we can just build a Query
object that resolves on the results we get from both input variables.
gql still has good pagination support if performance is a concern: https://graphql.org/learn/pagination/
@koraa GraphQL is a ok query language, but that's not the point. It is popular and well-known and seems to be a good architectural fit [to be validated] for expressing selective reads over disparate data sets.
@craeyu this is looking good. A few suggestions:
markdowns
sounds wrong to me, I'd prefer documents
. Never heard someone use "Markdown" in pluralThere are two things I'd like to see in this branch of the exploration before we look at the second approach (AppSync + Parsing on demand)
preview
that have been last modified by user emmasmith
, then get the markdown document corresponding to the file and check if the title has more than three words"
- [ ] population of the database through to Helix Bot
@craeyu I spent some time exploring the bots, let me know if you need some pointers. To start with, I guess it's simpler for you to create your own github app, so it's easier to play around with. later we can integrate your code into to the helix-bot.
@craeyu I have another use case for GraphQL queries, that might be worth considering in your explorations: tag-based navigation of the content.
Taking https://sling.apache.org/tags/community.html as an example (because I built it, not because it's pretty ;-), the list (or cloud) of all tags found on the website is included in navigation pages, and each content page displays its tags as links that point to such navigation pages.
The tags are defined as Markdown front matter.
I suppose the navigation pages could be generated based on GraphQL queries. One potential problem is that with Helix Strains you can aggregate multiple sources of content in a single website, we'd need to take that into account.
https://github.com/adobe/helix-example-advanced/issues/5 has more details, if you think such a use case is useful for your project and demos.
i cooked up a simple github graphQL query that demonstrates some features github gql has but it doesn't have the level of granularity we desire yet (whether that is possible is still under investigation):
query getCommitFilesAndContent($owner: String!, $repoName: String!, $commitHash: String!, $targetBranch: String!, $targetFile: String!){
repository(owner: $owner, name: $repoName) {
# look up a specific commit based on the commit hash
commit: object(expression: $commitHash) {
... on Commit {
oid
author {
date
email
name
}
}
}
# look up all the files in a directory
files: object(expression: $targetBranch) {
... on Tree {
entries {
name
}
}
}
# plain GQL does not support string and variable concatenation and interpolation
content: object(expression: $targetFile) {
... on Blob {
text
}
}
}
}
together with these variables:
{
"owner": "craeyu",
"repoName": "craeyu.github.io",
"commitHash": "58050cb996440859dfe9950a8a8f381124331745",
"targetBranch": "constance/githubgql:",
"targetFile": "constance/githubgql:index.htm"
}
returns the following result:
{
"data": {
"repository": {
"commit": {
"oid": "58050cb996440859dfe9950a8a8f381124331745",
"author": {
"date": "2019-07-10T17:14:42.000-07:00",
"email": "faithwyu@gmail.com",
"name": "constance yu"
}
},
"files": {
"entries": [
{
"name": "css"
},
{
"name": "favicon.ico"
},
{
"name": "fonts"
},
{
"name": "images"
},
{
"name": "index.htm"
},
{
"name": "js"
}
....
]
},
"content": {
"text": "<!DOCTYPE html>\n<html lang=\"en\"><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n\n<meta http-equiv=\"X-UA-Compatible\" content=\"IE=Edge\">\n<title>Constance Yu</title>\n<link rel=\"stylesheet\" href=\"./css/magnific-popup.css\">\n<link rel=\"stylesheet\" href=\"./css/font-awesome.min.css\">\n\n<link href=\"./css/css\" rel=\"stylesheet\" type=\"text/css\">\n\n<!-- main css -->\n<link rel=\"stylesheet\" href=\"./css/style.css\">\n\n<!-- favicon -->\n<link rel=\"icon\" type=\"image/png\" href=\"./favicon.ico\">\n\n</head>
......
</html>"
}
}
}
}
here is another query that queries all the branches i was playing with (saw ferrum haha @koraa)
a few things to notice:
after debriefing with @davidnuescheler, for now we are trying to use the azure + postgreSQL model to perform a simple task of scanning a github repository to extract all its markdown files, and parse it to get its title (currently defined as # TITLE
), and populate the postgreSQL database. for now it is a simple node.js application that is named helix-scanner
.
originally we were also interested in querying the files by last modification time, but the immediate issue is that git does not keep track of file modification date directly, only commit timestamps. (extra readings for those who are interested)
if you are interested in testing it locally you can ask me for my tokens and passwords please dont make it public yet since i started using environment variable halfway and people can see the sensitive info from commit, thx!
after that we think we are going to look into the idea @bdelacretaz mentioned, sounds really interesting!
newest progress update: the nodejs app can take in --owner repo/owner/name --repo repo/name --path path/name
to get the titles of any repo's markdown files in the specified path. it does not recursively parses through the directory trees for now.
Great progress! A question though: The url
column stores the url to the document I assume. Why did you store the GitHub API url rather than the raw protocol url?
hello @stefan-guggisberg, it is because when you want the content of github files and call the corresponding API with the file sha, the returned response only contains the api url rather than the html url. but i think it might be possible to use the html url, would need some tweaking works
@craeyu The GitHub API is rate limited, i.e. it allows 5000 authenticated requests / 60 unauthenticated requests per client per hour. I guess that sooner or later you'll run into those limits.
The GitHub raw protocol OTOH is not rate limited. It's not officially documented but widely used. The url's follow a simple structure. When you have the owner, repo name, path and reference (branch or tag name) you can easily build the url. See https://github.com/adobe/git-server#1-git-raw-protocol for examples.
@stefan-guggisberg ah i see - that makes things much easier. thank you! and yes already ran into the rate limit issue even as an authenticated user 😥
so I think the problem is to find a stable url for the file, the contents
api returns:
$ curl -s https://api.github.com/repos/tripodsan/hlxtest/contents/index.md
{
"name": "index.md",
"path": "index.md",
"sha": "e8ccf96a926b0e59e21626ccb8563c231e1f1e97",
"size": 1276,
"url": "https://api.github.com/repos/tripodsan/hlxtest/contents/index.md?ref=master",
"html_url": "https://github.com/tripodsan/hlxtest/blob/master/index.md",
"git_url": "https://api.github.com/repos/tripodsan/hlxtest/git/blobs/e8ccf96a926b0e59e21626ccb8563c231e1f1e97",
"download_url": "https://raw.githubusercontent.com/tripodsan/hlxtest/master/index.md",
"type": "file",
"content": "...",
"encoding": "base64",
"_links": {
"self": "https://api.github.com/repos/tripodsan/hlxtest/contents/index.md?ref=master",
"git": "https://api.github.com/repos/tripodsan/hlxtest/git/blobs/e8ccf96a926b0e59e21626ccb8563c231e1f1e97",
"html": "https://github.com/tripodsan/hlxtest/blob/master/index.md"
}
}
so the blob-id is the SHA of the actual blob. there is IMO no way to retrieve the blob via the SHA from the raw api.
but you could use the SHA of the commit of the file. by using the git-resolve-ref
action or similar.
eg:
$ curl "https://adobeioruntime.net/api/v1/web/helix/helix-services/resolve-git-ref@v1?owner=tripodsan&repo=hlxtest&ref=master"
{
"fqRef": "refs/heads/master",
"sha": "0eec03399af02e2418316114dfc35710c463e102"
}
$ curl -s https://raw.githubusercontent.com/tripodsan/hlxtest/0eec03399af02e2418316114dfc35710c463e102/index.md
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
...
it would even be helpful to store the commit SHAs of the ref when the files were indexed.
i think that stability or other transaction-like behavior is neither desired nor required.
in my mind the index/table is an incomplete, best effort representation of a subset of the content in md.
it is acceptable that the index is wildly out of sync with the content (imagine a change in the index config which requires a reindex of a large repo) and i can't come up with any usecase that would use anything but the HEAD of a branch for a query.
@craeyu I spoke with @davidnuescheler today, and we had an idea on how to handle the index definitions.
we could use the pipeline to render the table entries for each index. for example, having a script like: json_idx.pre.js
would kick in, when you request a resource like /readme.idx.json
. since it runs the normal pipeline, you'd get the dom representation of the markdown for free.
the response would be something like:
{
"url": "https://raw.githubusercontent.com/tripodsan/hlxtest/master/index.md",
"tables": [{
"name": "basic",
"entries": {
"title": "Hello Project",
"description": "This is my hello projects.",
}
}, {
"name": "authors",
"entries": {
"title": "Hello Project",
"authors": ["toby", "david"],
}
}]
};
so you could use the response to update your database.
@tripodsan @davidnuescheler ah makes sense. good idea!
i actually think that the url
is probably not really as it would be the base path (between the host to the first .
) of the request url.
for more complex pipelines i think the connection to the original .md
is probably less relevant...
so a request to:
https://helix-home-adobe.project-helix.page/hackathons/5-bsl.idx.json
would produce...
{
"tables": [{
"name": "basic",
"entries": {
"title": "Helix Hackathon Part V",
"description": "May 20-24 2019 - Basel, Switzerland"
}
}, {
"name": "authors",
"entries": {
"title": "Helix Hackathon Part V",
"authors": "@rofe @trieloff @stefan-guggisberg @koraa @davidnuescheler @anfibiacreativa @meryllblanchet @royfielding @tripodsan @simonwex @lkrapf @kgera @dominique-pfister @bdelacretaz @asanso @kamendola @acollign @kptdobe"
}
}]
};
and would in turn create tables/rows that look like this
path | title | description |
---|---|---|
/hackathons/5-bsl | Helix Hackathon Part V | May 20-24 2019 - Basel, Switzerland |
path | title | authors |
---|---|---|
/hackathons/5-bsl | Helix Hackathon Part V | @rofe @trieloff @stefan-guggisberg @koraa @davidnuescheler @anfibiacreativa @meryllblanchet @royfielding @tripodsan @simonwex @lkrapf @kgera @dominique-pfister @bdelacretaz @asanso @kamendola @acollign @kptdobe |
@davidnuescheler cool. yes if that is the case, i agree. we can use path as primary key.
currently just finished making the request recursive and moved the app into local forked helix-home
doing some adjustments right now
when using the app with the root as path
i actually think that the
url
is probably not really as it would be the base path (between the host to the first.
) of the request url.
ok, but having the path
present in the JSON is IMO a good idea anyways. so you don't need to guess the id
from the request, and you don't need to remember it, in case you transfer the response to another action.
also, the location of the original document is IMO important as well. in case, the backend wants to do some further processing.
if you look at the response from the github API, they include a bunch of back-references to the document itself and further relations. I think this is a good practice.
moved the app into local forked
helix-home
doing some adjustments right now
@davidnuescheler should we create a dedicated repo for this?
With:
idx.json
.idx.json
ending and the SHA (to keep it cachable)we might be able to circumvent the database entirely, which can be useful in the case of schema updates (just add a field to idx.json
and you are done, no re-indexing needed)
ok, but having the path present in the JSON is IMO a good idea anyways. so you don't need to guess the id from the request, and you don't need to remember it, in case you transfer the response to another action.
ok agreed.
also, the location of the original document is IMO important as well. in case, the backend wants to do some further processing.
i think that's semantically wrong, because we are not indexing the .md
anymore but really the resulting DOM or HTML output, which may at that stage contain a lot of information from other sources.
@davidnuescheler should we create a dedicated repo for this?
at some point yes, but at this point helix-home
just serves as a sample content repo, and the easiest way to add the .idx
stuff is by putting this into the src
to make development really easy...
latest update - able to issue request using helix-pipeline-like idx_html.pre.js
to process markdowns recursively through the github repo. so far all these have been integrated with the node js app.
next steps are to make the db queries smarter (e.g. when the table given this name does not exist yet, create the table then populate data) and add capabilities to handle more complicated table schema entries. excited to make idx_html.pre.js
more powerful! :>
update:
scanner app can create new tables and adjust columns dynamically depending on the json input received. when a new table name shows up, a bare bone default table will be created. it would always have path as its primary key and then dynamically add columns as they new entries come
question: if client uses
queryselectorAll
to select multiple elements from the document, how should we represent the result in postgresQL database? use anArray
for multivalued field type? when to determine what is appropriate for an array type? e.g. 1 title per file seems like an convention
stumbled upon postgresql json
(credit: @davidnuescheler) and JSONB
type. it is pretty interesting. PostgresQL supports them and native query supports operators for JSON objects. ->>
returns object field in JSON so that we can continue chaining them... the thing that is slightly inconvenient is the GraphQL layer since once we commit to JSON object it is a blob and we can no longer use the clean gql query to get subfields cleanly like this.
{
documents {
entries {
title
images {
title
}
}
}
}
i will look into the PostgresQL <=> GraphQL resolvers more to see if there is a workaround. for now we can either query all content and later on parse it ourselves or only select one field we are interested in by specifying path
and changing it with .
.
selecting all content
querying specific field
idx_html.pre.js
description field is now working! it uses the selection criteria .title .header p
. interesting that the header gave me back the hamburger 🍔
one workaround of querying JSON objects through subfields is to create a new VIEW
that has the subfields you desire. for example:
CREATE VIEW title_listing AS
SELECT
path,
entries->>'title' AS title
FROM
simple
;
then you can write GraphQL query to query subfield title
now:
another new feature: using keywords where
and _contains
can aid user to create queries that are tailored to get all documents containing desired information. e.g. all documents with the title Getting Started with Helix
created a standalone bot called helix-scanner-bot
that listens to git push
events and consults helix rendered DOM for metadata. currently working on my personal repo craeyu/helix-home
and next step is db integration. still blocked on azure provisioning so no access to database - not entirely sure if the db connection will work with the bot yet
database issue has been solved (yay! 🎉) and confirmed that bot can use credentials to access it.
currently translating node.js app language into bot language. finished the functionalities of parsing and storing data in JSONB
type field with path
as primary key.
tomorrow will have a PR up for adobe/helix-home that contains my idx_html.pre.js
that has the extractions of content. i will also refactor the code into async/await style and add a feature along the way supporting storing data for new content type img
.
after discussion with @davidnuescheler @trieloff @tripodsan , we created a new pipeline named idx_json
that returns an json object instead of stringifies the DOM data and add to context.content.json
.
currently the two data storage features have been fully transported to helix-bot
on my branch constance/scanner-bot
and the file is called scan.js
(one creates just one simple json column and the other one generates multiple columns based on the result received from pipeline). the next item in action is to test this bot with adobe/helix-home
repo and refactor it completely using await/async instead of callbacks
adobe i/o runtime
and working with adobe/helix-home
functionality wise it is fulfilled! still going through construction of async/await (goodbye "call back hell", it was fun having you faithfully serving me... 😿)
here is a sequencing diagram of my project so far (thanks for the help @tripodsan)
What's the status here? Moving out of "in progress" column for now.
I think we've found a viable alternative approach and can close this.
Problem
With the existing GitHub API (+
raw
) we have an easy, RESTful way of accessing repository contents, which is largely sufficient for delivery in Helix. When it comes to content management (think AEM site admin), we need something more advanced to support queries like this:Helix
in the titleThis requires not just a way to express structured queries, but also a way to inspect the contents of a markdown "file" and treat it as a helix "document", i.e. expose sections, metadata, embeds, links, external object, etc.
When it comes to the query language, it seems like GraphQL is the way to go (and something we should explore). If we use GraphQL, we need a way to run a GraphQL gateway as a serverless application.
Note: GitHub does have a GraphQL API that allows us to run queries like "get all Markdown files authored by @davidnuescheler in the branch
2020-launch
". It does not allow us to look into the structure of the Markdown files.Approach 1
Until recently, the architecture I'd have used to approach this would look something like this:
content
repository that calls a functionThe two problems with this approach:
we need to install the webhook, so we cannot access arbitrary repositories
Approach 2
I've learned about AWS AppSync a while ago, but dismissed it because it seemed to be very mobile-focused, but after a few conversations with people that have been using it, I think it can be a viable alternative. AWS AppSync is a GraphQL gateway as a service (with a full serverless pricing model, so no paying for idle) and full extensibility (you can define custom resolvers as AWS Lambda functions).
As a result, the final architecture would look something like this:
raw
, parses it and returns the key metadata structure (this could use a variation of the JSON pipeline)Key aspects/challenges of this approach:
I suggest we try both approaches and compare which one works best for our use cases.