endoflife-date / endoflife.date

Informative site with EoL dates of everything
https://endoflife.date
MIT License
2.31k stars 690 forks source link

Offline copy of data #2530

Open anthonyharrison opened 1 year ago

anthonyharrison commented 1 year ago

I really like the idea but to avoid repeated calls of the API for every product I would like data on, I would like to be maintain a local copy of the data and then only download updates each time I start my application (or after a particular time period e.g. only request updates once every 24 hours)

Ideally, I would be able to get the data in JSON format which I can then manage locally.

Alternative would be to call the API for every product to get the product data for each product. But this would also require that I know all of the products in the first place which given the dynamic nature of the data isn't very attractive.

welcome[bot] commented 1 year ago

Thank you for opening your first issue here :+1:. Be sure to follow the issue template if you chose one.

adriens commented 1 year ago

I'm actually working on something like that :smile_cat:

marcwrobel commented 1 year ago

Hi @anthonyharrison, thank you for the idea.

endoflife.date is using the static site generator Jekyll. Given the static nature of endoflife.date that may be difficult to implement: JSON and HTML file are only generated when there is an update on the master branch.

captn3m0 commented 1 year ago

Would a dataset published via a NPM package be good enough? Or a separate git repository that could fulfill the "update whenever needed" requirement easily?

I've been wanting to do this for a while, by means of uploading the generated JSON files (preferably in the v1 API format) to a release on GitHub.

But this would also require that I know all of the products in the first place

As an aside, we have an endpoint that solves this: https://endoflife.date/api/all.json.

@adriens Could you detail your plan to solve for this?

anthonyharrison commented 1 year ago

The API endpoint is a good start. Just getting a download of all of the data in JSON would be very useful. To find out what has changed since the last download could be done a number of ways. Simplest is to say if there has been any changes since a particular date in which case just download all the data again. The more elegant but slightly more complex would be to download all the changes since a particular date rather than all the data. But given the current amount of data isn't huge I imagine the first solution would be a good start.

I would rather not force the introduction of a new ecosystem (npm).

On Sat, 18 Feb 2023, 14:18 Nemo, @.***> wrote:

Would a dataset published via a NPM package be good enough? Or a separate git repository that could fulfill the "update whenever needed" requirement easily?

I've been wanting to do this for a while, by means of uploading the generated JSON files (preferably in the v1 API format) to a release on GitHub.

But this would also require that I know all of the products in the first place

As an aside, we have an endpoint that solves this: https://endoflife.date/api/all.json.

@adriens https://github.com/adriens Could you detail your plan to solve for this?

— Reply to this email directly, view it on GitHub https://github.com/endoflife-date/endoflife.date/issues/2530#issuecomment-1435686201, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAID24Q4TJCDSD74QGYSXTWYDK4ZANCNFSM6AAAAAAU7ZXDKQ . You are receiving this because you were mentioned.Message ID: @.***>

adriens commented 1 year ago

Maybe would you appreciate this repo : https://github.com/adriens/endoflife.date-nested

anthonyharrison commented 1 year ago

@adriens I can certainly use this as a starting point. However the https://endoflife.date/api/all.json already provides the data in JSON - if this was enhanced to include some more metadata e..g the date of the data dump, this would be the start of something very useful.

anthonyharrison commented 1 year ago

Hi @anthonyharrison, thank you for the idea.

endoflife.date is using the static site generator Jekyll. Given the static nature of endoflife.date that may be difficult to implement: JSON and HTML file are only generated when there is an update on the master branch.

@captn3m0 Would it not be possible to maintain a history of changes to the information contained within the _data directory and then return details of the products which have changed via an API? The API endpoints will allow me to get all of the data but they will require that I get all of the data and not just the updated?

marcwrobel commented 1 year ago

if this was enhanced to include some more metadata e..g the date of the data dump, this would be the start of something very useful.

A timestamp containing the date of the json file would be easy to add, but it requires the v1 API format (under development, see #2080 and https://deploy-preview-2080--endoflife-date.netlify.app/docs/api/v1/ for a preview). Unfortunately the current format (v0) cannot be updated without introducing a breaking change, and we did not planned to add new endpoints.

I do not mind adding a new /v1/products/all endpoint containing all the products with their corresponding release cycles. But that file will be big (don't know exactly how much, but at least a few MB). So I think we should consider Netlify bandwidth limits before doing that. @captn3m0, do you think it may be problematic ?

captn3m0 commented 1 year ago

Our current bandwidth usage is around ~50GB out of our 1TB limit, so I don't see any issue there. If this ever gets problematic due to this endpoint, we can set a redirect to another host/implement caching etc easily.

However, I don't think we should be abusing our API to essentially serve a dataset. I can suggest few alternative approaches:

  1. A separate repository called eol-dataset with a dump of all JSON files. This can be imported as a submodule for any usage easily.
  2. Setting up GitHub releases on this, or a separate repository, where we publish the data regularly. If this is automated correctly, a link like https://github.com/endoflife-date/endoflife.date/releases/latest/dataset.tar.gz will always point to the latest version of the dataset, and that can be used for any programmatic usage.

@anthonyharrison I'd be curious about the usecase here, to see if we can improve the API/documentation/roadmap further to account for this.

usta commented 1 year ago

We can also add a new json endpoint called XYZ_meta.json that will just keep the metadata for XYZ and users can decide to fetch whole real data in a hostedcached or our normal place XYZ_data.json So XYZ_meta.json can only keep metadata something like revision_id , revision_date , revision_dataurl so projects like adriens or someone else can make a check before fetching actual data this will help them to determine before downloading same big file ( for example all_data.json ) if its revision_date is same with their own

NOTE : adding just revision_date to our current endpoints wont fix the main problem that users still need to redownload same big file if we wont implement this idea @captn3m0 @marcwrobel @anthonyharrison @adriens

marcwrobel commented 1 year ago

NOTE : adding just revision_date to our current endpoints wont fix the main problem that users still need to redownload same big file if we wont implement this idea

@usta, is XYZ the product name ? If yes the product files are not that big (2 to 20 KB each I would say), so I think sending two requests separately could take longer than retrieving all the data in one shot.

Note that v1 product API endpoint already includes a lastModified field, corresponding to the last time the product file was updated. Example : https://deploy-preview-2080--endoflife-date.netlify.app/api/v1/products/ansible/.

anthonyharrison commented 1 year ago

@anthonyharrison I'd be curious about the usecase here, to see if we can improve the API/documentation/roadmap further to account for this.

@captn3m0 I am trying to develop an automated audit function which will identify whether a product is under support, under extended support or EOL and trigger some workflows For products which are nearing end of supptort, I want to be able to trigger a workflow to look at the upgrade path; for those which are EOL (or nearing EOL), I would want to trigger a different workflow.

usta commented 1 year ago

@usta, is XYZ the product name ? If yes the product files are not that big (2 to 20 KB each I would say), so I think sending two requests separately could take longer than retrieving all the data in one shot.

@marcwrobel Nope i mean all , upcomingEOL , ... endpoints

adriens commented 1 year ago

@adriens Could you detail your plan to solve for this?

@captn3m0 , I'll release a first draft in a few minutes :crossed_fingers:

adriens commented 1 year ago

@captn3m0 , here is a first proof of concept :

https://www.kaggle.com/datasets/adriensales/endoflifedate/

Please notice that :

endoflife date

image

image image

image image image image

adriens commented 1 year ago
select category,
    count(*)
from product_categories
    group by category
    having count(*) > 10;

image

adriens commented 1 year ago

There are some cool surprises I'm working on too, on the same topic.

adriens commented 1 year ago

image image image image

adriens commented 1 year ago

:point_up: Other files will be added : does anyone want to give a try to a ; image

adriens commented 1 year ago

:star_struck: image

adriens commented 1 year ago

:thought_balloon: :

adriens commented 1 year ago

:point_right:

adriens commented 1 year ago

:memo: Some dedicated blog post

MartinPetkov commented 1 year ago

I opened a PR to implement the idea in https://github.com/endoflife-date/endoflife.date/issues/2530#issuecomment-1439830897, since I liked that idea and would make use of it myself.

@adriens I see this as orthogonal to your efforts. Your work seems much more full-featured as compared to the simple GitHub Action I wrote, but I still think having a GitHub Release with a simple file is useful.

adriens commented 1 year ago

his is not a requirement for the time being, I guess because this information is not always available. Do you know where this information can be found ?

Yes, both approach are useful :+1:

adriens commented 1 year ago

Hi guys, I finally could manage to get something quite consistant, check https://www.kaggle.com/code/adriensales/endoflife-date-offline-copy/notebook

exports.tar.gz

adriens commented 1 year ago

Hi @anthonyharrison here is something for you

:point_down:

marcwrobel commented 1 year ago

Just did a first test on #2080 to export all the products with their versions on a single endpoint : https://deploy-preview-2080--endoflife-date.netlify.app/api/v1/products/full.

The generated JSON file is much smaller than what I expected : 696K uncompressed / 82K compressed (gzip).

adriens commented 1 year ago

Great, I'll give a try to prepare integrations