Design a data-structure to store all information about resources

erlang-punch / awesome-erlang

An (accurate) list of awesome Erlang resources

44 stars 8 forks source link

Design a data-structure to store all information about resources #6

Open niamtokik opened 11 months ago

niamtokik commented 11 months ago

At this time of writing, I don't have any idea of the data-structure we could use to deal with these resources. I would probably use a map at first, with some fixed fields, it would be easy to export/convert to other format like JSON. A table containing tags/categories should also be present.

Maartz commented 11 months ago

Okay, so in order to be able to #7 a proper data structure to hold the data should be defined.

I've to admit that I'm really used to relational DB.

Though it seems this is more suitable to document oriented DB. Maybe a graph one since we tend to also value the link between resources.

I know that Erlang support graphs very well, and I've even used them a bit.

niamtokik commented 11 months ago

We are using mnesia. It's not negociable. Few structures have been created but are probably not correctly designed for the moment. As I previously said in #7 , we can also use it as simple object storage at first and design something better later. In this case, we need at least to have what we should store in it and the mandatory elements we can found by default in it.

niamtokik commented 11 months ago

One of my first idea was to create one table for each kind of resources. I think it's too complex and not enough flexible at first. The second idea was to define one resource by its url (everything is an url on internet), it's kind, with a list of categories and finally, its content. Something like that:

-type url() :: binary().
-type kind() :: undefined | github | gitlab | website | publication | blog | book | course | screencast | author.

# github resource definition
-type resource_github() :: #{ license => binary()
                            , contributors => pos_integer()
                            , last_commit => binary()
                            , open_issues => pos_integer()
                            , stars => pos_integer()
                            , forks => pos_integer()
                            , repo_size => pos_integer()
                            , watchers => pos_integer() 
                            }.

-type resource_gitlab() :: #{}.
-type resource_hex() :: #{}.
-type resource_publication() :: #{}.
-type resource_video() :: #{}.
-type resource_book() :: #{}.

-type() resource :: resource_github()
                  | resource_gitlab().

-record(resources, { url = <<>> :: binary()
                 , kind = undefined :: kind()
                 , resource = #{} :: resource()
                 , created_at = <<>> :: binary()
                 , updated_at = <<>> :: binary()      
                 }).

Maartz commented 11 months ago

And you would extract the name of the resource via its URL? Same goes for relatead resources. It should be in form of #{name: url} so a resource should hold this data.

Also I do not think if it worth having a clear distinction between gitlab and github. Since we look almost for the same thing in the 2 websites. Moreover you have other sites like Codeberg or Gitea that can be passed (I know vajority is on Github).

What do you think?

niamtokik commented 11 months ago

Also I do not think if it worth having a clear distinction between gitlab and github. Since we look almost for the same thing in the 2 websites.

I wanted to do the same at first but I think designing different provider can help to find duplicate/clone/fork of a repository. When I started to collect repositories I found many of them being simply a fork, with no modification. It can be highly confusing. So, to me, a resource must be unique (based on its URL or any kind of idea).

A resource has one or many tags/category to help people to find them. Categories are fixed, based on the one available on other awesome list (or the one we think they are important). Tags are "dynamics", we can simply put them to define quickly a project.

Another relation should exist, a relation between resources. For example, say we want to put nostr project in the database. It should give us something like that on the client side (as JSON object)

#{ kind => github
, url => "https://github.com/erlang-punch.com/nostr"
, resource => #{}
, relations => [#{ kind => author, resource => #{ name => "..." }]
, tags => [release,applicationhttp,client,relay]
, categories => [nostr]
}

The idea behind that is to find something enough flexible, like a document divided in two part: meta-data and data. the main important data are stored in resource field, and the rest can be stored around.

Moreover you have other sites like Codeberg or Gitea that can be passed (I know vajority is on Github).

That's correct but as you already know, the vast majority of the applications are available on Github and/or Gitlab. Quick and dirty clients have already been created using httpc for:

I would like to avoid using external depdencies for this project (again) and use only what we have in Erlang/OTP. The design of the structure previously shown is not correct to me, but could be okay to start the project

Maartz commented 11 months ago

Having a constraint on reading rather than writing makes it easier to work with a potentially changing structure, so I agree. We can fine-tune requirements throughout the process. I really like the resource => #{} because it adds a lot of value to the data itself. Sorry I really like the idea of graph, at least for the resource. What would be blocking to use it?

niamtokik commented 11 months ago

What would be blocking to use it?

Nothing I guess. This is a free and open source project, we can try anything, if it's easy and flexible enough, we can integrate it. The only constraint is Mnesia as back-end, but if you want to create something based on graph to deal with the relation, why not.

niamtokik commented 11 months ago

Before designing something, we should probably start by defining terms we will use, here a small glossary. Here the main one we should start with.

Url as binary(): An URL is a Resource identifier used as primary key. An Url MUST be unique across all the resources. An Url is composed of the protocol used (e.g. https), an hostname or an ip address (e.g. github.com) and a path (e.g. /erlang-punch/nostr) and an optional query part (e.g. x=1&x=2). Both of those values must be valid and can be checked using uri_string module.
Name as binary(): A name defining the resource. A Name is not unique.
Resource as map(): A Resource is a sanitized JSON object converted into map() coming from one external service. Only used fields MUST be present in this resource and MUST use JSON compatible terms.
Category as binary(): a category is an unique term across the database defining in which "category" a resource is from. A resource CAN only have ONE category. For example, if someone wants to add https://github.com/erlang-punch/nostr resource, it will be automatically added in github category. The github category is a reference to a module containing all rules to fetch, extract and sanitize a resource.
Tag or Tags as binary(): A tag is an extra term added to a resource defining its usage. For example, https://github.com/erlang-punch/nostr will have nostr tag as well as websocket and cowboy.

Let start by creating a new test module called awesome_resource.

-module(awesome_resource).

Resource

Here a draft of resource record.

-type url() :: binary().
-type category() :: binary().
-type resource() :: map().
-type tag() :: binary().
-type tags() :: [tag(), ...].

-record(awesome_resource, { 
    url = undefined                   :: undefined | url(),
    category = undefined              :: undefined | category(),
    resource = #{}                    :: resource(),
    tags = []                         :: tags(),
    created_at = undefined            :: undefined | pos_integer(),
    updated_at = erlang:system_time() :: pos_integer()
}).

Tag

A tag is a list of unique term in mnesia table. we can easily create it using this definition. Only tags present in this table can be added in the resource tag's list.

-record(awesome_tag, {
    key = undefined :: tag(),
    active = true   :: boolean()
}).

einariii commented 11 months ago

Wonderful discussion. For now I agree with the overall structure as it currently appears. I am keen on using Mnesia. It will be my first time. It's impressive to use modules/lambda functions to retrieve resource content. Is the retrieval done as the object enters the database, or after?

I understand Categories here are firstly means of dealing with data providers, not "categories" in the sense human readers conceptualize/think of things (programs might be for cryptocurrency, chatting, gaming, etc.). So which aspect of the database correlates to the resource organization on the static site? At first glance it would be -type kind(), but the Github or Gitlab, for example, are too generic for this purpose. Will Tags determine sub-categories of kinds?

niamtokik commented 11 months ago

It's impressive to use modules/lambda functions to retrieve resource content. Is the retrieval done as the object enters the database, or after?

Lambda functions or module/function pair are an abstraction here. The developer is in charge of creating it and defining the rules to fetch/sanitize data. At first, it could be simply a dirty function somewhere, at the end, we could create a behavior and create better categories like @Maartz was asking at the beginning of the conversation, for example, gitlab/github/gitea are all repositories but their API are not identical (github requires a token, gitlab does not have the same field, and gitea... I don't know for this one). Instead of categories, we could create something different, perhaps "kind" or "endpoint" or simply "service". Anyway, the goal is to offer a strong abstraction, if someone wants to add a new "category", one can simply creates a new module, and add its reference in a map() or in a dedicated mnesia table. This pattern is used in ejabberd or rabbitmq to extend them.

I understand Categories here are firstly means of dealing with data providers, not "categories" in the sense human readers conceptualize/think of things (programs might be for cryptocurrency, chatting, gaming, etc.).

Correct, that's also why perhaps using another term could be useful. Another term like kind could be a great alternative as well. In fact, at the end, we should have something close to what you are saying but we need to start with some easy bricks, creating an individual category for each providers is okay for the moment. When we have enough patterns, we could create another category called repository or sources and these categories will contain gitlab, gitub and so on. Personally, I think it's an optimization pattern and it should be done later.

So which aspect of the database correlates to the resource organization on the static site? At first glance it would be -type kind(), but the Github or Gitlab, for example, are too generic for this purpose. Will tags determine sub-categories of kinds?

When I started to cleanup the list, and shared it with others, some creators told me their application was not in the correct category. In fact, one project can be in many categories. In this case, this is not really a category anymore, a resource should probably have many Tags to categorize it.

Tags are dynamically (e.g. github tags present on each repository) or manually (e.g. when tags are missing) added on each projects. One resource can have one or many tags, and the search can be done based on these tags. The category, by itself, is not really important here at the moment, they are here only to help to this application to know where data can be retrieved. In other hands, I think the kind() type you are looking is Tags.

Perhaps another term should be used instead of category.

repository (a place where source code is stored)
- github
- gitlab (can be hosted on private instance, we must inform the application)
- gitea (can be hosted on private instance, we must inform the application)
video
- youtube
- dailymotion
- odyssee
publications (a place where publications are stored)
- official publication with DOI (easy to verify and fetch)
- publication without DOI (hard to verify and fetch)
- publication published in a website or a blog without DOI (we must inform the application)
author (someone in charge of a project or having published something)
- this one is not easy to create. Should we use a social network profile or a personal website? or should we create a specific URL or data-structure for it?
books
- books with an ISBN (easy to verify and collect information)
- books without ISBN (self published or digital format, hard to verify)
training
- from a company
- from a provider like udemy

So, that's a complex subject and because we not one provider but many, we should probably keep our data-structure simple, and use only the URL/URI as identifier. The category (or whatever it is called), is here to specify what kind of data will be available in resource field and how to collect it.

on the static site?

The site is only static on the back-end side. Nothing blocks us to send the data as JSON or ETF, and let the client (e.g. javascript) dealing with the relation. Don't forget we don't have 1M entries, at this time, less than 10k entries will be added. Any decent browsers available on the market can deal with that.

niamtokik commented 10 months ago

Let design something with SQL first, to explain what was my main idea about this data-structure and the relations between resources.

----------------------------------------------------------------------
-- resources/relations design for Erlang Punch Awesome List.
----------------------------------------------------------------------
DROP TABLE IF EXISTS relations;
DROP TABLE IF EXISTS resources;
DROP TYPE IF EXISTS category;

----------------------------------------------------------------------
-- we have a limited number of category, an enum seems more efficient.
----------------------------------------------------------------------
CREATE TYPE category AS ENUM (
  'archive',
  'author',
  'book',
  'company',
  'project',
  'publication',
  'repository',
  'social',
  'tag',
  'website'
);

----------------------------------------------------------------------
-- A resource is an URL, an object you can fetch, it could even be a
-- torrent magnet or an SSH server, but it MUST be unique. Here a
-- resource is made of:
--
--   - a category (or a class/type, if you prefer) defining what kind
--     of resource it is. A book, an author and/or a publication are
--     not the same.
--
--   - an optional name (mainly used for authors)
--
--   - a mandatory URI/URL pointing to an object on the web.
--
--   - an optional data field as JSON object containing a document
--     with more information regarding the resource. This element can
--     be updated dynamically by fetching information from API.
--
----------------------------------------------------------------------
CREATE TABLE resources (
  id BIGSERIAL PRIMARY KEY,
  category category NOT NULL,
  name VARCHAR,
  url VARCHAR UNIQUE NOT NULL,
  data JSONB DEFAULT '{}',
  created_at TIMESTAMP DEFAULT now(),
  updated_at TIMESTAMP DEFAULT now()
);

----------------------------------------------------------------------
-- A resource can have a relation with another resource. The relation
-- table is defining that.
----------------------------------------------------------------------
CREATE TABLE relations (
  id BIGSERIAL PRIMARY KEY,
  source_resource_id INTEGER REFERENCES resources(id),
  target_resource_id INTEGER REFERENCES resources(id),
  UNIQUE (source_resource_id, target_resource_id),
  CHECK (source_resource_id != target_resource_id)
);

----------------------------------------------------------------------
-- let add some tags
----------------------------------------------------------------------
INSERT INTO resources (category, name, url)
     VALUES ('tag', 'testing', 'https://en.wikipedia.org/wiki/Software_testing');
INSERT INTO resources (category, name, url)
     VALUES ('tag', 'static analysis', 'https://en.wikipedia.org/wiki/Static_analysis');
INSERT INTO resources (category, name, url)
     VALUES ('tag', 'dynamic program analysis', 'https://en.wikipedia.org/wiki/Dynamic_program_analysis');
INSERT INTO resources (category, name, url)
     VALUES ('tag', 'concolic testing', 'https://en.wikipedia.org/wiki/Concolic_testing');
INSERT INTO resources (category, name, url)
     VALUES ('tag', 'fuzzing', 'https://en.wikipedia.org/wiki/Fuzzing');
INSERT INTO resources (category, name, url)
     VALUES ('tag', 'profiling', 'https://en.wikipedia.org/wiki/Profiling_(computer_programming)');
INSERT INTO resources (category, name, url)
     VALUES ('tag', 'random testing', 'https://en.wikipedia.org/wiki/Random_testing');
INSERT INTO resources (category, name, url)
     VALUES ('tag', 'regression testing', 'https://en.wikipedia.org/wiki/Regression_testing');

----------------------------------------------------------------------
-- let add some information about joe armstrong
----------------------------------------------------------------------
INSERT INTO resources (category, name, url)
     VALUES ('author', 'Joe Armstrong', 'https://en.wikipedia.org/wiki/Joe_Armstrong_(programmer)');
INSERT INTO resources (category, url)
     VALUES ('website', 'https://joearms.github.io/');
INSERT INTO resources (category, url)
     VALUES ('website', 'https://armstrongonsoftware.blogspot.com/');
INSERT INTO resources (category, url)
     VALUES ('archive', 'https://www.kth.se/profile/jlarm/');
INSERT INTO resources (category, url)
     VALUES ('archive', 'https://www.sics.se/~joe/');
INSERT INTO resources (category, url)
     VALUES ('book', 'https://isbnsearch.org/isbn/9781934356005');
INSERT INTO resources (category, url)
     VALUES ('publication', 'http://ctp.di.fct.unl.pt/~aravara/pubs/editor/Foclasa_2011_Proceedings.pdf');
INSERT INTO resources (category, url)
     VALUES ('publication', 'https://dl.acm.org/doi/fullHtml/10.1145/1810891.1810910');
INSERT INTO resources (category, url)
     VALUES ('publication', 'https://dl.acm.org/doi/10.1145/1238844.1238850');
INSERT INTO resources (category, url)
     VALUES ('publication', 'https://dl.acm.org/doi/10.1145/1022471.1022472');
INSERT INTO resources (category, url)
     VALUES ('publication', 'https://ieeexplore.ieee.org/abstract/document/41871');
INSERT INTO resources (category, url)
     VALUES ('repository', 'https://github.com/ubf/ubf');

----------------------------------------------------------------------
-- let insert a relation in a dirty way!
----------------------------------------------------------------------
WITH author AS (
       SELECT id FROM resources WHERE category='author' and name = 'Joe Armstrong'
     ),
     resource AS (
       SELECT id FROM resources WHERE category='book' and url = 'https://isbnsearch.org/isbn/9781934356005'
     )
INSERT INTO relations (source_resource_id, target_resource_id)
     SELECT author.id,resource.id
       FROM author, resource;

My idea here is to have an easy way to extract all resources (with all relations as well). Creating one table for one category will be quite time consuming. All complexity is stored in JSON object that can be easily updated by an external tool. This object can be versioned later.

erlang-punch / awesome-erlang

Design a data-structure to store all information about resources #6

Resource

Category

Tag