apotheon / CopyfreeWorks

Copyfree Works: what it says on the tin --
http://copyfree.org/resources/works
8 stars 1 forks source link

Add datasource value to records. #28

Open apotheon opened 7 years ago

apotheon commented 7 years ago

@lbmn proposed new metadata:

The proposal involved a works_databases_references key to provide references to sources of information about works included in the list, as exemplified in a web paste for a YAML work submission @lbmn provided. This kind of thing could prove useful in the transition to a proper works database, with some automated population of new works entries and perhaps automated alerts of changes to projects behind existing entries based on those data sources.

At this time, datasources seems like a better metadata key name at this time, though the specific format of the values in the datasources array must still be considered. Please share any ideas and suggestions in comments here, or in discussion in the #copyfree and/or ##copyfree channels on freenode.

lbmn commented 7 years ago

(0) Brainstorming The Reference Format Name

The temporary column name of works_databases_references in the web paste was in likeness to the existing CI conventions. CI is using the word "works" to refer to the various things in this data set (programs / scripting libraries / plug-ins / fonts, videos, books, etc). Spelling out the word "references" and use of underscores was mimicking the license_reference field. Together the name (although a bit verbose) is descriptive of the concept I am introducing.

Using datasources is ok, but it would be better if we were to think this through and come up with a unique name / initialism to define this concept, initiating a reusable and potentially standard way to reference downloadable works across various works databases. Since it would reference data sources that also contain non-copyfree works, we shouldn't use "copyfree" in the name.

I think the terms "content" and "package" combine to a good general term for what we are talking about. It is more descriptive than the term "work", which can be mistaken for other definitions of this word. Also a work can be unfinished, unpublished, not information-based, etc.

The term "content package" differs from using the word "package" alone, indicating that we can be talking about things other than software: ebooks, structured data packages, Web-site snapshots (ex. ZIM), videos, etc. It differs from using the word "content" alone to indicate something that isn't a free-flowing scrap of content (like this post) but an organized versioned aggregate of all the pieces needed for some end (like a source repository complete with issues discussions).

And what we are defining here is a way to link to / reference to content package metadata with other database / authority / index. We are connecting different ecosystems: FreeBSD packages link to other FreeBSD packages (ex. as dependencies), and NPM packages link to other NPM packages, but we want to link to both. I think the term interlink is fitting.

And so, until I get better ideas, my suggested name for this standard is: Content Package Interlink Format, or C.P.I.F.! :smiley:

It's also a play on "copy if" - you may or may not want to copy this content package depending on the meta-data you in CPIF links.

But I hope someone suggests something better, so the name is of course subject to change.

(1) Brainstorming The Reference Format Structure

The CPIF link format has to be multi-part, with the first part identifying the package database, and at least one more to uniquely identify the specific package.

Since BSD ports datasources (likely our most significant data source for software) and Gentoo Portage use a slash-delimited path to identify records, I also used a slash following the prefix. Other content package databases may use a deeper-layer hierarchy. This also maps easily URLs and the Unix filesystem, including some new ideas for the latter (ex mv /usr/ports /cpif/freebsd).

It is an open question if maybe CPIF should organize the data-sources by category (ex. /cpif/video/youtube, /cpif/software/cabal, etc). I currently think this is a bad idea, because some databases could fall into multiple categories, and it would be best to deal with that further down the path (ex. /cpif/facebook/video/$ID, /cpif/facebook/photo/$ID). Also, some data sources defy easy categorization.

(2) Database Identifiers

I think my web paste example covers most foreseeable scenarios. (Note that it contained an error: I forgot to edit out ".se" from the pkgsrc prefix when pasting.) In light of the above brainstorming, it should now read:

h2o:
  uri:
  - https://h2o.examp1e.net
  tags:
  - server
  - software
  - web
  license:
  - MIT/X11 License
  license_reference:
  - https://h2o.examp1e.net/faq.html
  cpif:
  - github/h2o/h2o
  - freebsd/www/h2o
  - pkgsrc/www/h2o
  - opensuse/h2o
  - homebrew/h2o

It is an open question about whether we should use domain names for the projects (ex. brew.sh) rather than a simplified ID string (ex. homebrew). I think that the latter is the way to go. This way we can maintain consistency even if domain names change (ex. a gTLD to Namecoin exodus). Also sometimes there are multiple sites for a package database: some more formal for the project (ex. freebsd.org, pkgsrc.org) while other third party sites contain the actual metadata (ex. freshports.org, pkgsrc.SE).


(To be continued...)

apotheon commented 7 years ago

Format Name

I'm not a big fan of the term "content" for this purpose. You say this to justify it:

The term "content package" differs from using the word "package" alone, indicating that we can be talking about things other than software

This usage seems to imply the common use of the term "content" on the internet, which actually implies it's not software. Broadening it enough to incorporate software, though, turns the word back into its generalized default meaning: stuff inside something else. That is so broad as to be meaningless, and coupled with "package" it becomes even less meaningful, because a "content package" then just becomes a "package containing contents". Duh, of course -- that's what packages do (contain contents). As such, I think "content package" is a largely pointless term that does not actually describe what we mean in any useful fashion. It is, in fact, likely misleading. I would be more inclined to use "package" by itself than "content package", which does not have to refer specifically to software. In fact, many package management systems (originally designed for software) deliver non-software packages as well as software packages. Consider the existence of documentation packages in, e.g., Debian and FreeBSD package archives.

I find your objection to the word "work" unconvincing. The term has some relevant meaning in law, as well as ample precedent for exactly the sort of meaning we want. It is aptly descriptive of what the copyfree works database would address: copyrightable works (and, thus, copyfree-able works).

Of course, I also think that the "interlink format" you propose is naturally more general than just some medium or protocol for sharing data about content, software, works, or whatever else you might describe in terms no less specific. I don't really have any objections to the terms "interlink" and "format", but think it needs a different name. Perhaps "metadata interlink format" works better, and lends itself easily to a typical filname extension profile (.mif) if such is needed.

I think all I really like about the name as a whole you that you invented is the implicit reference to the standard Unix file copy utility, cp. All this is roughly irrelevant to the matter of figuring out how to actually format the metadata in YAML, though.

Format Structure / Database Identifiers

I think I'm happier (but not fully happy -- more on that in a moment) with something like:

freebsd/ports/h2o
github/repository/h2o
opensuse/yast/h2o

. . . or something like that. On the other hand, URIs might be appropriate instead, because both my above alteration of your suggestion and your suggestion itself run afoul of the problem of needing to maintain some kind of separate concordance metadata to help resolve that information to a machine readable set of directions to the original. Maintaining consistency in the first tier metadata despite changes in the location of the dataset itself is like putting lipstick on a pig, because one still has to maintain the concordance as a second tier of metadata -- that is, one must still have the pig behind the lipstick -- resulting in extra data having to be maintained for the sole purpose of trying to make things look pretty, trading away performance and (relatively) easy reliability of data maintenance to get it.

Those are my sixteen cents. Two cents ain't what they used to be.

lbmn commented 7 years ago

I'm fine with mif.

I'll just make a few nitpicks, but leave the final decision to CI.

It would be interesting to have a lengthy debate on codifying a reexamined computing terminology — like how the word "content" means "digital stuff you can download and consume" (now including apps) rather than package "contents", etc, etc, etc — but this isn't the place.

I haven't thought of this being a new file format (*.mif), but a string syntax format to be used in other file formats - like how URI / Href syntax is used in HTML / etc. (Major differences from Href would include: *1* It can be a list of reference strings instead of just one. *2* Having a centrally-defined prefix lookup table instead of arbitrary server. *3* No protocol, port, etc; only path.)

Some of your mif reference path examples seem unnecessarily verbose. What could possibly go under /mif/github/ if not repositories? With FreeBSD there are indeed things outside of ports (base source tree, docs / handbook, Web-site source tree, mailing list, etc), but those would be very rarely used. Maybe it would be better to reference them as /mif/freebsd-base/blah instead?

But, again, I leave the details up to CI. "Don't let perfect be the enemy of good."

I just wanted to emphasize the importance of coming up with a reusable standard for referencing works metadata sources, ideally with a memorable name. This syntax can then be used for a number of my future projects, like a package manager for installing copyfree software, fonts, Nim libraries, ebooks, offline website snapshots, etc.

apotheon commented 7 years ago

What could possibly go under /mif/github/ if not repositories?

gists, GitHub Pages, and wikis

With FreeBSD there are indeed things outside of ports (base source tree, docs / handbook, Web-site source tree, mailing list, etc), but those would be very rarely used. Maybe it would be better to reference them as /mif/freebsd-base/blah instead?

I'm not entirely sure what you're suggesting.

I just wanted to emphasize the importance of coming up with a reusable standard for referencing works metadata sources, ideally with a memorable name.

That's a good idea, of course, and I don't object in principle. Getting into the practical details, though, I still think that just providing literal directions to the source information is probably more useful.