Create "data dictionary" for all SCTK fields

We need a comprehensive "data dictionary" of all fields that may be present in a ScanCode output file. The minimum requirement is a list of fields with type (single value, list or ?) and description. This should, of course, be versioned SCTK releases. We have some of this information for the CSV output at https://scancode-toolkit.readthedocs.io/en/latest/cli-reference/output-format.html#custom-output-format (which may not be current), but we do not have it for the full set of output fields in the JSON output. There is currently no single file in the codebase (like Django models), but two files seem to contain most of the field definitions:

./src/scancode/api.py - non-Package fields
./src/packagedcode/models.py - Package fields (see section starting line 361)

A first step should be to investigate whether there are existing automated documentation tools for Python that would help us get started.

This Issue supersedes #112 which is pretty stale at this point.

From comments in the gitter/discuss channel that would be important:

@pombredanne

the thing is that the output of scancode should be (self) documenting from the models. We are using mostly attrs classes so imho the plan could be:

ensure that we are using @hynek https://github.com/python-attrs/attrs/ attrs classes all the way possibly adopting @Tinche https://github.com/Tinche/cattrs for nested types

define/design a simple way to add a docstring of sorts to each model and attribute ... there is some example on that here https://github.com/nexB/scancode-toolkit/blob/d9ae6e62ebad6a896cf5b58185d833302e95c72d/src/packagedcode/models.py#L143 using this https://github.com/nexB/commoncode/blob/fbe882da6c03352c8043cdd45c72b7ca44239e6d/src/commoncode/datautils.py#L45

There is also some ticket that I have tracked there python-attrs/attrs#357

have/create a way to get all that integrated into sphinx (possibly with custom extensions). Publish that as part of the doc publication

Enjoy and relax reading a beautiful doc :P

Check also this older approach https://github.com/nexB/scancode-toolkit/tree/d9ae6e62ebad6a896cf5b58185d833302e95c72d/etc/scripts/sch2js

This was taking the generated JSON and reversing a JSON schema from that. That could be OK too as an approach.

Tim Hatch @thatch

I don't know about where to document, but sphinx does handle something called "doc comments" -- comments before an assignment that start with #:

documented itself at https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoattribute

hey @AyanSinhaMahapatra @pombredanne I have seen this new project idea where we have to create docs automatically from scan code data Earlier I have worked on my personal python package which is a python module that contains Python-based minimal and clean example implementations of popular data structures and all major algorithms and to explain how this module can be used I have also created docs for this package and you can find this here So in these docs, you can find that I have imported the main source files where all programs are written, for example, consider this DP section, you can see in this section main source code along with doc strings is being imported with the help of automodule method(which is used to Include documentation from doc strings present in the source code for more info refer this), coming back to the point, with the help of this method we can easily fetch source code from main files and can display doc strings along with code, and like this, we can automate the AboutCode Documentation

autodoc is an extension of the sphinx which is used to include documentation from docstrings that is available inside the source code, so there are many in-built methods that are available in the autodoc extension like automodule, autoclass and autoexception etc.

.. automodule:: : which can be used to import any source file having all the classes and functions
.. autoclass:: : which can be used to import any class inside the source program.
The key to using these features is the :members: attribute. If:
- If we don’t include it at all, only the doc string for the object is brought in:
  - For example as @mjherzog commented, we can easily import api.py from src/scancode in docs and then sphinx will only parse this doc string
- if we just use :members: with no arguments, then all public functions, classes, and methods are brought it that have a docstring.
- If we explictly list the members like :members: , , those explict members are brought.

For more refer this

So like this way, we can only parse that data that we want to show in docs more specifically, since the data we want to show, is only a small part of the classes, i.e. there will be (a lot) of functions and methods also documented which we don't want. which only takes out certain docstrings/class attributes having the documentation for that data field, and creates the doc from there, so like this way we can automate the docs for scancode data

@robin025 the information about autodoc/automodule/autoclass is known, but the issue here is, we want to document attribute members of classes, (not all members, only the ones that end up in the result data selectively).

Take an example of this.

Here in licensedcode/models.py, there's a class Rule. We don't want to create documentation based on this class itself. We want to document some of it's members, that end up in the result data (after running a license scan with -l). You can look into a example scan result, like on in the output format docs to see that these attributes that end up in the result are in the matched_rule section of each of the licenses that are detected.

{
  "path": "samples/zlib/iostream2/zstream.h",
  "type": "file",
  "licenses": [
    {
      "key": "mit-old-style",
      "score": 100.0,
      "name": "MIT Old Style",
      "short_name": "MIT Old Style",
      "category": "Permissive",
      "is_exception": false,
      "owner": "MIT",
      "homepage_url": "http://fedoraproject.org/wiki/Licensing:MIT#Old_Style",
      "text_url": "http://fedoraproject.org/wiki/Licensing:MIT#Old_Style",
      "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:mit-old-style",
      "spdx_license_key": null,
      "spdx_url": "",
      "start_line": 9,
      "end_line": 15,
      "matched_rule": {
        "identifier": "mit-old-style_cmr-no_1.RULE",
        "license_expression": "mit-old-style",
        "licenses": [
          "mit-old-style"
        ],
        "is_license_text": true,
        "is_license_notice": false,
        "is_license_reference": false,
        "is_license_tag": false,
        "matcher": "2-aho",
        "rule_length": 71,
        "matched_length": 71,
        "match_coverage": 100.0,
        "rule_relevance": 100
      }
    }
  ],

Now, we aren't documenting functions/entire classes as they exist in scancode now. We have to document these attributes of classes, selectively, and in all of scancode and it's plugins.

So, if we have to use the autodoc extension, we have to write one new class each for each of the attributes (there are a lot of them) that we want to document, write the docs in their docstring, and then use autoclass to collect all the docs from there. I'm not opposed to this, but we should consider all the options. If you see the suggestion of @thatch above, autoattuributes is a much better way rather than using autoclass here, if this method is preferred at all.

The documentation generation part would be easier this way, because nothing has to be done to collect these, but in the original suggestion above yours, it's more cleaner code wise and seems a better way to me, though some exploration has to be done on the collection and docs part.

So consider the method suggested in the comment above by @pombredanne , and look into the scancode data by generating some scan results and looking where the data for the attributes are located, and let us know if you have any questions there.

aboutcode-org / scancode-toolkit

Create "data dictionary" for all SCTK fields #2008