DiseaseOntology / HumanDiseaseOntology

Repository for the Human Disease Ontology.
Creative Commons Zero v1.0 Universal
346 stars 109 forks source link

Remove 'releases' directory #858

Closed beckyjackson closed 3 years ago

beckyjackson commented 4 years ago

The src/ontology/releases/ directory is now over 12GB which is quite a bit of space. This slows down git push and git pull speeds, as well as takes up space on developers' machines.

Since moving DO over to GitHub, we have done regular releases using the GitHub release system and tagged these releases with dates. This means that all contents of the releases directory are already accessible from the releases tab of the GitHub page. Instead of keeping the full-sized contents of releases, this data is already stored, compressed, in the hidden .git file.

The releases directory was recommended best practice back when ontologies used Subversion for version control, but it doesn't make sense with Git. Everything else can be safely removed since it's stored in Git's version control, and there's no reason to keep adding release files to the releases directory.

All previous releases will still be available at their PURLs, since the PURL redirects to the GitHub release tag.

lschriml commented 4 years ago

Let's create a directory in src/ontology/releases for release files prior to 2020, pre-2020_DO_releases zip the pre-2020 files from the src/ontology/releases folder

Cheers, Lynn

beckyjackson commented 4 years ago

The release files are already stored in the hidden .git file, and can be accessed with git either on the command line, or you can find them in https://github.com/DiseaseOntology/HumanDiseaseOntology/releases

Storing them compressed in another folder would only duplicate the data. It will definitely be much smaller, but it would still be about 1GB of data.

There's no reason to keep storing new releases in the releases directory, either, since they can be found in the same way.

lschriml commented 4 years ago

Thank you Becky, I will think about the options. Cheers, Lynn

beckyjackson commented 4 years ago

Great! Let me know if you have any questions or concerns. I think this would be a big improvement to the maintainability of the GitHub repository.

beckyjackson commented 4 years ago

Did you have a chance to consider the options for this? Thanks!

lschriml commented 4 years ago

yes, I've reviewed/researched. I want to keep the files in the repository. The previous years (releases from Dec 2019) could be zip'd and tar'd . But I don't want them removed from this directory.

Cheers, Lynn

beckyjackson commented 4 years ago

OK - we can absolutely do that, but could I ask why you decided to keep them as zip files?

Git was created for text files, and while it can handle binary files like zips, it's not very good at it and doesn't compress them like it does text files so size can still end up being an issue. This is the same reason we don't commit the ROBOT JAR to the repository - here's an article explaining why to avoid commit binaries. Here's another short answer about the same concept. For my own purposes, I'm just curious about what research you found that supports keeping zips in version control?

Another note - if you're worried about the releases not being accessible if we remove that directory, Git stores all past files. They will always be accessible via git, e.g. git checkout v2018-09-07 will take you to the Sept 7 2018 release (or you can view this release by going to https://github.com/DiseaseOntology/HumanDiseaseOntology/tree/v2018-09-07)

beckyjackson commented 4 years ago

I was wondering if you had a chance to review my last comment? I'm still curious what you found to support keeping zips in version control before proceeding. Thank you!

beckyjackson commented 3 years ago

@lschriml I'm removing the "low priority" tag form this because I think this is something important that we should resolve. The repository is getting much larger with each release. I was wondering if you had a chance to look at my comment from Jun 23? Thanks!

beckyjackson commented 3 years ago

I also want to mention that having all the releases stored slows down Travis quite a bit. Each time Travis runs, it has to download the whole repository.

lschriml commented 3 years ago

Hello Becky, I would be OK with moving 2019 and previous releases to a different DO GitHub directory. In the src/ontology/ directory, e.g. Prior Releases/2019_Releases, 2018_Releases, etc. Keeping the previous year (2020) in the current Releases directory. At the end of each year, move that year (e.g. at end of 2021, move 2020_Releases to 'Prior Releases' directory.

I have concerns that if we do not have them in our GitHub, that it will become an issue in the future, for showing our longevity. I understand that the releases can be found in GitHub, however, for grant reviewers, that may be too much of a threshold for them to see the Releases.

And add a link to the 'Prior Releases' on the DO Releases page: https://github.com/DiseaseOntology/HumanDiseaseOntology/releases

This should solve the issue and aid in my concerns, what do you think ?

Cheers, Lynn

beckyjackson commented 3 years ago

My biggest concern is that we are duplicating data. Currently, it takes over 6 minutes for our test framework just to checkout the repo. This should only take about ~10-30 seconds. Even if we move this to a separate repository, we're still duplicating the data and it would be more difficult manage (we'd have to manually move things over).

I don't think there are any other OBO projects that preserve their releases this way since moving to GitHub.

If we remove the releases directory, all releases will still show up here: https://github.com/DiseaseOntology/HumanDiseaseOntology/releases

We can always provide links to older releases to show longevity, too, for example, a release from March 2016: https://github.com/DiseaseOntology/HumanDiseaseOntology/releases/tag/v2016-03-11

If you click "source code", you can download the repo at that time and view DO as it was on March 11, 2016. Alternatively, you can get the file directly here: https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2016-03-11/src/ontology/doid.owl (and we can configure the PURL redirects to make this a prettier URL)

Do you think this would be good for grant reviewers to show longevity?

beckyjackson commented 3 years ago

Hi @lschriml - I think I've come up with a good compromise. It's small and automatically maintained, so you don't have to change anything about what you do now.

We replace the releases/ directory with a RELEASES.md file that you can point to from any documentation, or for an easy compilation to show grant reviewers. I think this is actually easier to see the progress we've made than looking at the releases directory. This file contains links to all dated releases, and I will write a script that automatically updates it each time you make a new release on GitHub.

The latest release will appear first, with older releases at the bottom. I will automatically populate the file with all the existing releases. Here's what I'm thinking it should look like:


DO Releases

2021 Releases

2021-02-24

This release includes 10,671 disease terms, the addition of 23 new diseases, 41 definitions and 211 SubClassOf statements in this release. New terms include glioma molecular subtypes and Bainbridge-Ropers syndrome, axioms defining transmission methods for bacterial infectious diseases and DO's Spring 2021 UMLS update.

OWL OBO JSON
Disease Ontology doid.owl doid.obo doid.json
Human DO HumanDO.owl HumanDO.obo
DO Non-Classified doid-non-classified.owl doid-non-classified.obo doid-non-classified.json
DO Merged doid-merged.owl doid-merged.obo doid-merged.json

2021-01-28

This release of 10,648 human diseases, includes 50 new diseases, including Parkinsonism and vascular Parkinsonism, a revised glioma classification, new subtypes for pemphigus, developmental and epileptic encephalopathy, and hypocalcemia.

OWL OBO JSON
Disease Ontology doid.owl doid.obo doid.json
Human DO HumanDO.owl HumanDO.obo
DO Non-Classified doid-non-classified.owl doid-non-classified.obo doid-non-classified.json
DO Merged doid-merged.owl doid-merged.obo doid-merged.json

2020 Releases

2020-12-22

This release includes a single syntax update from the previous release.

OWL OBO JSON
Disease Ontology doid.owl doid.obo doid.json
Human DO HumanDO.owl HumanDO.obo
DO Non-Classified doid-non-classified.owl doid-non-classified.obo doid-non-classified.json
DO Merged doid-merged.owl doid-merged.obo doid-merged.json
lschriml commented 3 years ago

Hello Becky, excellent !! Let's do this. One question: when I am making a release in GitHub, will it go to this new directory or to the 'releases' folder ? If I make a release, then decide to delete it, would this be done in the .md file or in the usual releases folder ?

Please post on the group's Slack channel when this change is being made, as it will impact where files are pointed to in the website and potentially where Mike and Dustin retrieve files.

If possible, I would like to keep only the most recent release in the old 'releases' folder ? I ask, as some external users use this url to retrieve the latest release.

Thank you for figuring this out !! Much appreciated, Lynn

On Wed, Mar 10, 2021 at 9:33 AM Becky Jackson notifications@github.com wrote:

Hi @lschriml https://github.com/lschriml - I think I've come up with a good compromise. It's small and automatically maintained, so you don't have to change anything about what you do now.

We replace the releases/ directory with a RELEASES.md file that you can point to from any documentation, or for an easy compilation to show grant reviewers. I think this is actually easier to see the progress we've made than looking at the releases directory. This file contains links to all dated releases, and I will write a script that automatically updates it each time you make a new release on GitHub.

The latest release will appear first, with older releases at the bottom. I will automatically populate the file with all the existing releases. Here's what I'm thinking it should look like:

DO Releases

  • 2021 Releases <#m_-5617293235933040683_2021-releases>
  • 2020 Releases <#m_-5617293235933040683_2020-releases>
  • 2019 Releases <#m_-5617293235933040683_2019-releases>
  • ... (and so on)

2021 Releases 2021-02-24

This release includes 10,671 disease terms, the addition of 23 new diseases, 41 definitions and 211 SubClassOf statements in this release. New terms include glioma molecular subtypes and Bainbridge-Ropers syndrome, axioms defining transmission methods for bacterial infectious diseases and DO's Spring 2021 UMLS update. OWL OBO JSON Disease Ontology doid.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/doid.owl doid.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/doid.obo doid.json https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/doid.json Human DO HumanDO.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/HumanDO.owl HumanDO.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/HumanDO.obo DO Non-Classified doid-non-classified.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/doid-non-classified.owl doid-non-classified.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/doid-non-classified.obo doid-non-classified.json https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/doid-non-classified.json DO Merged doid-merged.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/doid-merged.owl doid-merged.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/doid-merged.obo doid-merged.json https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-02-24/src/ontology/doid-merged.json 2021-01-28

This release of 10,648 human diseases, includes 50 new diseases, including Parkinsonism and vascular Parkinsonism, a revised glioma classification, new subtypes for pemphigus, developmental and epileptic encephalopathy, and hypocalcemia. OWL OBO JSON Disease Ontology doid.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/doid.owl doid.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/doid.obo doid.json https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/doid.json Human DO HumanDO.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/HumanDO.owl HumanDO.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/HumanDO.obo DO Non-Classified doid-non-classified.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/doid-non-classified.owl doid-non-classified.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/doid-non-classified.obo doid-non-classified.json https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/doid-non-classified.json DO Merged doid-merged.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/doid-merged.owl doid-merged.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/doid-merged.obo doid-merged.json https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-01-28/src/ontology/doid-merged.json

2020 Releases 2020-12-22

This release includes a single syntax update from the previous release. OWL OBO JSON Disease Ontology doid.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/doid.owl doid.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/doid.obo doid.json https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/doid.json Human DO HumanDO.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/HumanDO.owl HumanDO.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/HumanDO.obo DO Non-Classified doid-non-classified.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/doid-non-classified.owl doid-non-classified.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/doid-non-classified.obo doid-non-classified.json https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/doid-non-classified.json DO Merged doid-merged.owl https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/doid-merged.owl doid-merged.obo https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/doid-merged.obo doid-merged.json https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2020-12-22/src/ontology/doid-merged.json

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DiseaseOntology/HumanDiseaseOntology/issues/858#issuecomment-795503077, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBB4DMFTM3DF5YAGO6NWYTTC5YK5ANCNFSM4M77TOLQ .

-- Lynn M. Schriml, Ph.D. Associate Professor

Institute for Genome Sciences University of Maryland School of Medicine Department of Epidemiology and Public Health 670 W. Baltimore St., HSFIII, Room 3061 Baltimore, MD 21201 P: 410-706-6776 | F: 410-706-6756 lschriml@som.umaryland.edu

beckyjackson commented 3 years ago

when I am making a release in GitHub, will it go to this new directory or to the 'releases' folder ?

When you make a release, a "tag" is created that creates a URL for that release. So it will not go to a specific destination, but will be accessible via the links which I will put in the RELEASES.md file

If I make a release, then decide to delete it, would this be done in the .md file or in the usual releases folder

The RELEASES.md file will only update when you make the official GitHub release, not when you run make release from the command line. If you do make a GitHub release and then decide to delete it, just let me know and I'll update the RELEASES.md file.

If possible, I would like to keep only the most recent release in the old 'releases' folder ?

Yes, this is possible and seems reasonable to me.

lschriml commented 3 years ago

Great !! Please let the team know about these upcoming changes via Slack. And go ahead and make all of the changes.

How are you doing ? Back in Oregon ? Spring is in full swing in Maryland.

Cheers, Lynn

On Wed, Mar 10, 2021 at 10:06 AM Becky Jackson notifications@github.com wrote:

when I am making a release in GitHub, will it go to this new directory or to the 'releases' folder ?

When you make a release, a "tag" is created that creates a URL for that release. So it will not go to a specific destination, but will be accessible via the links which I will put in the RELEASES.md file

If I make a release, then decide to delete it, would this be done in the .md file or in the usual releases folder

The RELEASES.md file will only update when you make the official GitHub release, not when you run make release from the command line. If you do make a GitHub release and then decide to delete it, just let me know and I'll update the RELEASES.md file.

If possible, I would like to keep only the most recent release in the old 'releases' folder ?

Yes, this is possible and seems reasonable to me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DiseaseOntology/HumanDiseaseOntology/issues/858#issuecomment-795568480, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBB4DMGPTATISVYQGFNPFTTC54HZANCNFSM4M77TOLQ .

-- Lynn M. Schriml, Ph.D. Associate Professor

Institute for Genome Sciences University of Maryland School of Medicine Department of Epidemiology and Public Health 670 W. Baltimore St., HSFIII, Room 3061 Baltimore, MD 21201 P: 410-706-6776 | F: 410-706-6756 lschriml@som.umaryland.edu

lschriml commented 3 years ago

The Releases folder has been updated. Closing the ticket.