asyncapi / website

AsyncAPI specification website
https://www.asyncapi.com
Apache License 2.0
440 stars 578 forks source link

Measuring AsyncAPI spec adoption #780

Open derberg opened 2 years ago

derberg commented 2 years ago

Reason/Context

We do not know how many people use AsyncAPI. The most accurate number we could get is the amount of the AsyncAPI users that work with AsyncAPI documents. But how measure how many people out there created/edited AsyncAPI file?

The answer is a solution that includes:

20210528_114537

Some more discussion -> https://asyncapi.slack.com/archives/C0230UAM6R3/p1622198311005900

Description

  1. Create new endpoint in server-api service that anyone can use to fetch AsyncAPI JSON Schema files of any version
  2. JSON schemas are in https://github.com/asyncapi/spec-json-schemas and can be used as normal dependency
  3. Whenever JSON Schema file is fetched by the user, information should be stored somewhere. I propose Google Tag Manager as we already have it for the website, we can send data there and then easily read data. I'm all ears if there is something better and still free
  4. Add AsyncAPI config to SchemaStore and have a configuration on AsyncAPI side that will always automatically open a PR against SchemaStore to provide a new location of a new version of the JSON Schema for the new AsyncAPI spec version
  5. Update docs and instructions for users how to configure IDE properly and how to name files. Update official examples

If time left, we need to expose numbers somewhere. Either embed Google Analytics diagram somewhere on the AsyncAPI website or just have at least an API endpoint that exposes the latest numbers.

For GSoC participates

github-actions[bot] commented 2 years ago

Welcome to AsyncAPI. Thanks a lot for reporting your first issue. Please check out our contributors guide and the instructions about a basic recommended setup useful for opening a pull request.
Keep in mind there are also other channels you can use to interact with AsyncAPI community. For more details check out this issue.

ritik307 commented 2 years ago

@derberg Sounds interesting ... I would like to take this issue as my GSOC'22 proposal.😊

derberg commented 2 years ago

@ritik307 sounds awesome!

@smoya @BOLT04 @magicmatatjahu any objections to have this endpoint first on server-api? i personally think better to add it here and then if we measure to big traffic, we can always split into a separate microservice

magicmatatjahu commented 2 years ago

No problem for me, but we have to remember that we also provide that project as Docker Image, so people will have also that path. We have to think how to avoid unnecessary paths for people which use that project.

BOLT04 commented 2 years ago

any objections to have this endpoint first on server-api? i personally think better to add it here and then if we measure to big traffic, we can always split into a separate microservice

@derberg no problem for me 🙂, this is pretty cool!

No problem for me, but we have to remember that we also provide that project as Docker Image, so people will have also that path. We have to think how to avoid unnecessary paths for people which use that project.

@magicmatatjahu I get what you're saying and if this new endpoint does in fact need to use external services (e.g. Google APIs), we would need new config/environment variables for API keys, etc. I propose we use feature flags to solve this. On our deployed version of the API the feature is on, but for local development it's not. If someone wants to try it out locally, they just have to configure the necessary values and turn the toggle on to start measuring spec adoption in their own environment 🙂 What do you think?

smoya commented 2 years ago

I love the idea of adding our schemas to Schema Store. I didn't know about it until 5 minutes ago and I like it a lot 👍.

I want to add some feedback regarding the creation of a service for serving the schemas:

Serving static files such as JSON Schema files in a fast and reliable way is exactly the reason why CDNs exist. Considering the possible amount of traffic this service will have, and the fact it will keep growing on time (more users, more tooling, etc), I would not advocate for creating and maintaining this service by ourselves. And in fact, the good point is that we are already using a free CDN for serving our website: Netlify (which has multiple cloud providers).

I understand we want this service because we need those metrics (maybe there is another strong reason I missed, so please correct me). Would it make sense to just investigate and ask/pay for their Analytics product?

There is also the following Draft PR by @jonaslagoni : https://github.com/asyncapi/website/pull/502 that might make sense to check. It aims to serve AsyncAPI JSON Schema files from our website.

On the other side, we could consider the same approach with any other CDN product that offers analytics, such as AWS S3, GCP Cloud Storage, etc (asking for budget, etc).

I would like to know your thoughts.

cc @derberg @ritik307 @magicmatatjahu @BOLT04

derberg commented 2 years ago

Cool. I think the most important is that you support the idea. It is not written in stone to have it as endpoint here:

@magicmatatjahu @smoya @BOLT04 please only keep in mind that we should leave as much as possible up to @ritik307 (if you still want to take this task for GSoC). You folks turn into mentors, just guide @ritik307 what needs to be checked and tried out to get the desired outcome.

ritik307 commented 2 years ago

@magicmatatjahu @smoya @BOLT04 please only keep in mind that we should leave as much as possible up to @ritik307 (if you still want to take this task for GSoC). You folks turn into mentors, just guide @ritik307 what needs to be checked and tried out to get the desired outcome.

Sure @derberg I would love to take this task for GSOC 😊 and it would be great if you guys mentor me.😊

magicmatatjahu commented 2 years ago

I think that file hosting and adding metrics to ServerAPI itself will not be a problem. We have control over every part, so we won't have to use additional services. However, CDN would be better in this case and I am for this option!

smoya commented 2 years ago

I would like to retake this, especially after @derberg raised its concern on https://github.com/asyncapi/website/pull/502#issuecomment-1070960957.

There is something we should consider before moving forward with a custom solution with our own service. Right now we do not have services exposed openly for being consumed in the same frequency as static JSON Schema files would be. Exposing a service that has such important responsibility (it would serve our JSON Schema files!) should include a large battery of APM + Infrastructure monitoring, and maybe in the future (emphasis on future) also consider some on-call rotation. It might seem far from today, but if our user adoption keeps growing as it does, it would be a thing.

With a CDN provided by a SAAS company, you remove all of those concerns.

Again, I know we want some metrics, but IMHO it is totally worth asking Netlify and if it fills our goals, pay if needed for the Metrics service. I can tell you is worth paying for a service than having to handle your own high-available service.

derberg commented 2 years ago

regarding @smoya point about maybe using Netlify Analytics in combination with CDN. This is also one of the possible options. Some investigation for sure needs to be done first. This can definitely be an outcome for this task. I personally prefer CDN, just ignored the fact that Netlify might have some Analytics for it.

This is definitely what I prefer since you mentioned Netlify Analytics. Did you mean this https://docs.netlify.com/monitor-sites/analytics/ or something else?

smoya commented 2 years ago

regarding @smoya point about maybe using Netlify Analytics in combination with CDN. This is also one of the possible options. Some investigation for sure needs to be done first. This can definitely be an outcome for this task. I personally prefer CDN, just ignored the fact that Netlify might have some Analytics for it.

This is definitely what I prefer since you mentioned Netlify Analytics. Did you mean this https://docs.netlify.com/monitor-sites/analytics/ or something else?

Yup, this is the service I meant.

Netlify Analytics is available and ready, right in the dashboard, for any site you deploy to Netlify. It only costs $9/mo per site. Source: https://www.netlify.com/products/analytics/

derberg commented 2 years ago

some important info: https://github.com/asyncapi/website/pull/502#issuecomment-1088363548

smoya commented 2 years ago

some important info: asyncapi/website#502 (comment)

IMHO we stay with option to have all JSON files in server-api that would work like a proxy to do analytics. It is up to server-api maintainers to decide if it is ok to first do it in server-api and then if because of the load, it should be split. Nevertheless, IMHO JSON files should not be exposed directly on the website here as we are looking an opportunity to track adoption.

TL;DR: I still think we should avoid creating a new file server app. Instead, look for an alternative based on a SaaS provider . And I'm suggesting some alternative ideas to the previous one. I'm happy to keep evolving this idea and also to put it in practice asap.

I understand the need to get such metrics and how simple it seems to build a file server with built-in metrics. However, I want to stay strong on this idea: We should avoid managing services on our own (at this time). Some of the reasons have been exposed already in (my) previous comments but I'm going to list some of them here with a bit more detail.

AsyncAPI JSON Schema definitions are the most important pieces of software we provide to the community (IMHO). They are meant to be used by systems for parsing and validating AsyncAPI documents and services that use them on runtime for validating messages, among other use cases. We do have a package for both NodeJS and Go projects that users can use to import those schemas into their projects; however, we don't for any other language, meaning tooling will need to fetch those files at some point from the source.

However, who are the users of those raw files, and how do they use them? I can imagine a few use cases:

With this in mind, the following points are worth to be noted:

Having said that, I'm proposing stick with a SaaS-based solution from the first day that allows us only to take care about the very minimum: as max, collecting the metrics and processing them, but never about serving the files.

We tried with Netlify Analytics. Unfortunately, the metrics we want (hits on JSON Schema files) are not collected. Even though it is a matter of time they support it, we don't have an ETA for it.

There are several other ways we can do this, and those are some of the ideas I have in mind:

Netlify Log Drains

Netlify Log Drains allows sending both traffic logs and function logs to an external service, such as New Relic, Datadog, S3... and also to our own service (could be a Netlify Function as well). Netlify sends those logs in batches in near real-time. Logs are JSON/NDJSON format. You can see the output of those logs here. This is not available in all plans, but I'm sure the Netlify support team will be happy to enable this, especially now that we tried Analytics but didn't fulfill our use case.

sequenceDiagram
    participant User
    participant asyncapi.org (Netlify)
    participant AsyncAPI Metrics collector
    Note right of AsyncAPI Metrics collector: Netlify Function <br/>or<br/> any monitoring SaaS

    User->>asyncapi.org (Netlify): https://asyncapi.org/definitions/2.3.0.json
    asyncapi.org (Netlify)->>User: 2.3.0.json
    asyncapi.org (Netlify)-->>AsyncAPI Metrics collector: Netlify Log Drains metrics

With this approach, and in the more complex solution, we will only care about the metrics collector service, which could eventually be down but won't affect the user request. In the case of using any SaaS, it will be straightforward. As a side note, there are free tiers in services like NewRelic that maybe could fit our case.

Netlify Edge Handlers

Netlify Edge Handlers work by letting you executing code on the edge directly, intercepting the request. We could run Javascript code there to collect the metrics we want; in our case, the hits on the definition files. This is in BETA right now (you should ask to enable it). However, I would ask you for an ETA for going public. I guess they should have plans to release it as a public beta in the short-mid term.

EDIT: Netlify Edge Functions are now public beta, available for free. https://www.netlify.com/blog/announcing-serverless-compute-with-edge-functions

Use AWS S3

AWS S3 is a well-known solution for storing files. And with the metrics they expose (Cloudwatch), we could know the number of get operations per file. We would need to add a Netlify rewrite rule (not a redirect) that proxies the requests to the S3 bucket. This is easy to configure through the netlify.toml file.

sequenceDiagram
    participant User
    participant asyncapi.org (Netlify)
    participant AWS S3

    User->>asyncapi.org (Netlify): https://asyncapi.org/definitions/2.3.0.json
    asyncapi.org (Netlify)->>AWS S3: Netlify rewrite rule to asyncapi.s3.amazonaws.com/definitions/2.3.0.json
    AWS S3->>asyncapi.org (Netlify): 2.3.0.json
    asyncapi.org (Netlify)->>User: 2.3.0.json

The price for this is not pretty high. I did a quick estimation for 30 million requests per month (yeah, a lot) here. We should also include the price for the Cloudwatch metrics, but IIRC is almost "nothing."

If price is a concern, we could investigate Cloudflare R2, which is super cheap. However, the metrics they provide are unknown to me at this moment. Also, we would need to ask for access to R2 as it is in Beta at this moment.

smoya commented 2 years ago

From today, Netlify Edge Functions (Previously known as Edge Handlers) are now public beta, available for free. https://www.netlify.com/blog/announcing-serverless-compute-with-edge-functions

smoya commented 2 years ago

With the following, we could add the metrics push into the Netlify function https://github.com/asyncapi/website/pull/680

derberg commented 2 years ago

Taking this one off GSoC as it is important topic to handle and can't be delayed

derberg commented 2 years ago

How to start 😄 Lemme start with the positives ❤️

I love idea from https://github.com/asyncapi/website/pull/680

On the "negative" side. I have completely different view on Maintainance/High Availability/Response-time topics:

So, lets fo forward with idea from https://github.com/asyncapi/website/pull/680


Alternative/compromise: to not mix topics and try to solve all with one solution. Maybe https://github.com/asyncapi/website/pull/680 could have 2 alternative paths, one for the needs related to AsyncAPI JSON Schema and $id and the other that we use only in SchemaStore. One solution with separate paths, and measure data are clean. Still depend on the same rate limits anyway of course

smoya commented 2 years ago

I've been playing with Google Analytics 4 as a candidate for publishing our metrics. I have to say, I didn't get a good result. We could send events through the Measurement Protocol and it will kinda make the thing, but the UX for reading those metrics is completely awful:

1. In the whole realtime metrics, only a small rectangle including the events is present:

2022-05-03 at 10 39 11 2022-05-03 at 10 37 09

2. The details are very hard to check (I added a param for the URL of the fetched file):

2022-05-03 at 10 38 07

As we can see, everything is focused on web apps, so not a really good fit for us. I know @derberg has played a lot with GA , Google Tag Manager, etc. Do you think it is still a fit for this, or should we rather consider using another alternative?

smoya commented 2 years ago

I've been checking NewRelic One new free tier, and it allows to send up to 100GB of data, events included. I did a simple test with a POST request and created a simple dashboard to see how it would look like.

2022-05-03 at 16 16 11@2x

Btw, New Relic has NRQL, a custom query language that allows you to easily query anything you send to them in a SQL query language fashion.

If anyone has another suggestion, I'm happy to keep investigating (there are plenty of others out there)

smoya commented 2 years ago

In the meantime, I'm moving forward with New Relic solution by now, and the development is all here: https://github.com/asyncapi/website/pull/680

In case you want to use another provider for metrics, I'm happy to adapt the code.

More on https://github.com/asyncapi/website/pull/680#issuecomment-1117469838

derberg commented 2 years ago

GA allows you to also create new view, custom components, with scheduled reports etc. But yeah, I'm not GA evangelist.

Tbh I think the approach with New Relic is super nifty, as long as we can use it for free of course 😆 I guess you @smoya and @fmvilas can anyway get us more free storage if we need 😆

❤️ from me for New Relic

Does it mean we have an agreement on implementation? 🙌🏼 Before we finalize we need to give time to @magicmatatjahu @jonaslagoni @BOLT04 @fmvilas to voice opinion as they own either website or this repo, or just need solution (like Jonas)

BOLT04 commented 2 years ago

yeah let's go with the new relic solution proposed by @smoya 👍 I think with that the Server API doesn't need any implementation, so we could close this issue when that PR is merged right?

wdyt everyone?

derberg commented 2 years ago

I think we can even transfer it to https://github.com/asyncapi/website now 🤔

jonaslagoni commented 2 years ago

Awesome @smoya 👍

magicmatatjahu commented 2 years ago

I am also in favour of a solution using the New Relic @smoya 👏🏼

fmvilas commented 2 years ago

Yeah me too. Let's use New Relic. They have a powerful query language (NRQL) and it's easy to create new views of data 👍

smoya commented 2 years ago

JSON Schemas are now being served successfully under asyncapi.com/definitions and asyncapi.com/schema-store. A New Relic dashboard has been also created (can't be public unfortunately):

2022-06-21 at 08 04 17@2x

PR to Schema-Store is waiting for review: https://github.com/SchemaStore/schemastore/pull/2310

cc @derberg @fmvilas

smoya commented 2 years ago

JSON Schema Store PR has been merged now, meaning all JSON Schema files fetched from it are now being downloaded from asyncapi.com/schema-store and metrics show that users are already fetching them:

Metrics showing downloads count of AsyncAPI JSON Schema files

cc @derberg

derberg commented 2 years ago

Omg this is so exciting 😍

fmvilas commented 2 years ago

❤️ Indeed! @smoya start thinking how do we send custom metrics from tooling 😝

smoya commented 2 years ago

❤️ Indeed! @smoya start thinking how do we send custom metrics from tooling 😝

We would need to expose a service that acts as a metrics ingest forwarding them to NR so we don't expose the NR API key on tooling but just send metrics to our service.

I would think about it eventually!

smoya commented 2 years ago

After fixing https://github.com/asyncapi/spec-json-schemas/issues/236, JSON Schemas for different AsyncAPI versions are being downloaded from JSON Schema Store:

I see there are downloads from all versions, and I really doubt those downloads are organic or in purpose. What I think it's happening is that, since the schema served by Schema Store now is https://github.com/asyncapi/spec-json-schemas/blob/master/schemas/all.schema-store.json, JSON Schema parsers might be downloading ALL referenced ($ref) schemas at once instead of under demand.

However, VSCode IDE is still showing the error:

2022-07-13 at 17 49 23@2x

which @derberg mentioned already in https://github.com/redhat-developer/vscode-yaml/discussions/772#discussioncomment-3033074. So 🤷 ...

cc @derberg @fmvilas

derberg commented 2 years ago

which @derberg mentioned already in https://github.com/redhat-developer/vscode-yaml/discussions/772#discussioncomment-3033074. So 🤷 ...

I think we entered the world where we have to decide if we want to do things in our JSON Schema the way spec and spec maintainers recommend, or just adjust schema to work with tooling provided by the community 🤷🏼

I see there are downloads from all versions, and I really doubt those downloads are organic or in purpose.

yeah, the numbers for 2.0.0-rc1 and 2.0.0-rc2 are suspiciously high and the same 😄 I think you are completely right about the reason, that it is due to refs parsing. Can we measure the number of times https://www.asyncapi.com/schema-store/all.schema-store.json is fetched and automatically subtract that number from other downloads directly in the chart, without manual calculation? (kinda hack but I don't believe there is some other solution)

smoya commented 2 years ago

Can we measure the number of times https://www.asyncapi.com/schema-store/all.schema-store.json is fetched and automatically subtract that number from other downloads directly in the chart, without manual calculation? (kinda hack but I don't believe there is some other solution)

But if we do that, we will be invalidating all the counts for legitimate downloads. Correct me if I'm wrong, but:

Considering that 1 fetch of all.schema-store.json end ups doing 10 fetches (one for each schema per AsyncAPI version), let's say we start from scratch and we just do one fetch:

Downloads File
1 all.schema-store.json
1 1.0.0.json
1 1.1.0.json
1 1.2.0.json
1 2.0.0-rc1.json
1 2.0.0-rc2.json
1 2.0.0.json
1 2.1.0.json
1 2.2.0.json
1 2.3.0.json
1 2.4.0.json

We can't say, subtract 1 to each download, because this will end up happening:

Downloads File
1 all.schema-store.json
0 1.0.0.json
0 1.1.0.json
0 1.2.0.json
0 2.0.0-rc1.json
0 2.0.0-rc2.json
0 2.0.0.json
0 2.1.0.json
0 2.2.0.json
0 2.3.0.json
0 2.4.0.json
derberg commented 2 years ago

@smoya yeah, you are right 🤦🏼 it sucks

derberg commented 2 years ago

@smoya so looks like we can only measure adoption of the spec in general, not its specific versions?

smoya commented 2 years ago

@smoya so looks like we can only measure adoption of the spec in general, not its specific versions?

Yes, as the IDE plugins are downloading just one schema (containing all of the versions), we can't know which one they are using. As the Schema Store matching is based on file patterns and not with content from the file, there is no way we could send data in the request made to our servers (For example, a header including the version).

So unfortunately, I'm running out of ideas here. I could open an issue in Schema Store repo asking for ideas.

derberg commented 2 years ago

It is not that bad. For me, most important is to measure how many users we have. So adoption of the spec in general, and not each version. I'm personally skeptical of such measurements, as then people complain that new versions are not adopted forgetting that they also do not use new versions if they do not need them (anyway, not topic for this issue).

If you can open a discussion with Schema Store, on how to fix things in future, that would be amazing. As even if I'm not interested with specific version adoption, I bet others are 😄

Can you adjust dashboard in New Relic 🙏🏼

So what is left is:

missing something?

fmvilas commented 2 years ago

I will definitely be interested to know if people are really adopting version 3.0 once it's out. Would be cool to get some insights. Maybe it's time to measure it on our tools.

smoya commented 2 years ago

As even if I'm not interested with specific version adoption, I bet others are

I think it is a crucial metric, even though not the only method to collect data from. I would love to have a metric where, after a release, we could see how downloads for older versions go down in favor of the new one.

If you can open a discussion with Schema Store, on how to fix things in future, that would be amazing. As even if I'm not interested with specific version adoption, I bet others are 😄

Done. No hope at all anyway. https://github.com/SchemaStore/schemastore/issues/2440

dashboard adjustment

Do you mean removing the versions stuff from it?

persisting data for lifetime

Do we really need that? With New Relic, we have 1 year right now. If more is needed, we could write some scripts to do aggregations every few months.

investigating how data collection actually works, caching of schema by plugins, and etag refresh on Netlify side. So we know if we actually get data of only "daily active users" or "increasing number of new users"

Related: https://github.com/SchemaStore/schemastore/issues/2438

derberg commented 2 years ago

Do you mean removing the versions stuff from it?

yeah, until we get it solved, this metric is not helpful, we just need total number

Do we really need that? With New Relic, we have 1 year right now. If more is needed, we could write some scripts to do aggregations every few months.

yes we need lifetime data to see over years how numbers change. But I do not mean we need that support on New Relic. Automated script, maybe running on GitHub Actions on a schedule is also fine 👍🏼

Related: https://github.com/SchemaStore/schemastore/issues/2438

yeah, not much help, other than knowing you can clear the cache on demand. Source code indicates it is based on etag. What we need to check what Netlify does when website gets redeployed, if etag for all resources, even redirects is refreshed or not. We are doing some magic there 😄

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity :sleeping:

It will be closed in 120 days if no further activity occurs. To unstale this issue, add a comment with a detailed explanation.

There can be many reasons why some specific issue has no activity. The most probable cause is lack of time, not lack of interest. AsyncAPI Initiative is a Linux Foundation project not owned by a single for-profit company. It is a community-driven initiative ruled under open governance model.

Let us figure out together how to push this issue forward. Connect with us through one of many communication channels we established here.

Thank you for your patience :heart:

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity :sleeping:

It will be closed in 120 days if no further activity occurs. To unstale this issue, add a comment with a detailed explanation.

There can be many reasons why some specific issue has no activity. The most probable cause is lack of time, not lack of interest. AsyncAPI Initiative is a Linux Foundation project not owned by a single for-profit company. It is a community-driven initiative ruled under open governance model.

Let us figure out together how to push this issue forward. Connect with us through one of many communication channels we established here.

Thank you for your patience :heart:

smoya commented 9 months ago

@smoya so looks like we can only measure adoption of the spec in general, not its specific versions?

FYI, I tried to give it a last try, but didn't succeed 😞. All the info can be found at https://github.com/SchemaStore/schemastore/issues/2440#issuecomment-1857683852.

cc @derberg @fmvilas

derberg commented 9 months ago

overall adoption is still a great number to have 👍

smoya commented 7 months ago

FYI, I created https://github.com/SchemaStore/schemastore/issues/3460 as a feature request in Schema Store that, if adopted, will help us achieve our mission.

sambhavgupta0705 commented 4 months ago

@smoya may I know the update of this issue please 😅

smoya commented 4 months ago

@smoya may I know the update of this issue please 😅

What do you need to know in particular?

sambhavgupta0705 commented 4 months ago

What do you need to know in particular?

Like we will be going forward with this issue or not