EticaAI / HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)
https://hdp.etica.ai/
The Unlicense
3 stars 1 forks source link

[meta] pre-build packages, automated testing, unitary tests, integration tests; continuous integration (CI), etc #12

Closed fititnt closed 2 years ago

fititnt commented 3 years ago

From the tools on EticaAI/HXL-Data-Science-file-formats, the drafted (not yet even as proof as concept) library temporally called hxlm.core (see hxlm #11) is already increasing essential complexity. Even if this library become used mostly used by a few people on @EticaAI/@HXL-CPLP, I believe that the bare minimum would be to add tests so new features don't break past implementations or, if they have to break, at least we know when and what.

This dedicated issue is mostly to have public references if others need to set up similar features. Also continuous integration on it's own is different from code.

Context

The current hxlm.core is written in Python. While the concept was born from an single all-in-one file, HXLMeta (see hxlquickmeta` (cli tool) + HXLMeta (Usable Class) #9), we're drafting a concept (that may be too hard to be feasible) of Declarative Programming (see comment https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/11#issuecomment-788651928 ) and use YAML syntax to, at least:

  1. reference groups of data (often HDataset + HFile)
  2. reference of how the data can be used/manipulated (and this part is very import to be not only in local language, but designed to fit legal documents); while the internal name on v0.7.3 still hcompliance, in English would be like "acceptable-use-policy"

In a context of the original idea of HXL-Data-Science-file-formats the idea nave minimum viable products to enforce what is "right" and "what is wrong" and have tooling systems that deal with the technical parts could allow exchange sensitive data with fast paced while still respecting laws. Also does exist a need that even who (either semi-automated with by human or totally automated HRouting) could not see themselves sensitive data but still be able to parse metadata.

Please note that even if the point 2 (hcompliance) do have MVPs and plan from start to allow even automated auditing, the idea is make easier work of who already share sensitive data and need to make decisions quickly in the name of someone else while (if necessary) have logs of what was done.

Yes, the idea of automated tests in such context is not overkill compared to the full thing.

Also, all the things here are dedicated to public domain. Including the use of testing.

Evaluating continuous integration tools

At this moment I'm not sure if I should use the GitHub actions (that seems the new thing) or something more traditional like Travis-CI (the open source version).

I know that Travis has very good open source CPU time limits allowed. Not sure about GitHub. I know I could do something like setup an Jenkins, but as I'm also writing the python code (and also that I have no money to let yet more an server to be running for years; and this may need to know past issues) I think Jenkins is not an option now.

fititnt commented 3 years ago

Caralho, deu certo já no primeiro commit.

https://travis-ci.com/github/EticaAI/HXL-Data-Science-file-formats/

# .travis.yml
# @from https://github.com/tox-dev/tox-travis
language: python
python:
  - "3.7"
  - "3.8"
  - "3.9"
install: pip install tox-travis
script: tox

Captura de tela de 2021-03-04 06-25-03


Captura de tela de 2021-03-04 06-25-18


Captura de tela de 2021-03-04 06-25-30


Captura-de-tela-de-2021-03-04-06-46-17


fititnt commented 3 years ago

While hxlm.core, in particular the Htypes, makes sense to test the functions directly, at least for hdpcli in the short term, since internals are changing, it seems reasonable to test at a higher level. (from this comment) https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/16#issuecomment-802424548

Ok, just discovered this doctest thing https://docs.python.org/3/library/doctest.html.

So this python3 -m doctest -v hxlm/core/schema/vocab.py could be used to test what is documented like this:

class HVocabHelper:

    # (.......)

    def get_value(self, dotted_key: str, default: Any = None) -> Any:
        """Get value by dotted notation key

        Examples:
            >>> from hxlm.core.schema.vocab import HVocabHelper
            >>> HVocabHelper().get_value('datum.POR.i')
            >>> HVocabHelper().get_value('attr.datum.POR.id')
            'dados'

        Args:
            dotted_key (str): Dotted key notation
            default ([Any], optional): Value if not found. Defaults to None.

        Returns:
            [Any]: Return the result. Defaults to default
        """
        keys = dotted_key.split('.')
        return functools.reduce(
            lambda d, key: d.get(
                key) if d else default, keys, self._vocab_values
        )

Output

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ python3 -m doctest -v hxlm/core/schema/vocab.py
Trying:
    from hxlm.core.schema.vocab import HVocabHelper
Expecting nothing
ok
Trying:
    HVocabHelper().get_translation_value('attr.datum.POR.id')
Expecting:
    'dados'
ok
Trying:
    HVocabHelper().get_translation_value('datum.POR.id')
Expecting:
    'dados'
ok
Trying:
    from hxlm.core.schema.vocab import HVocabHelper
Expecting nothing
ok
Trying:
    HVocabHelper().get_value('datum.POR.i')
Expecting nothing
ok
Trying:
    HVocabHelper().get_value('attr.datum.POR.id')
Expecting:
    'dados'
ok
15 items had no tests:
    vocab
    vocab.ConversorHSchema
    vocab.ConversorHSchema.__init__
    vocab.HVocabHelper
    vocab.HVocabHelper.__init__
    vocab.ItemHVocab
    vocab.ItemHVocab.__eq__
    vocab.ItemHVocab.__init__
    vocab.ItemHVocab.__repr__
    vocab.ItemHVocab.diff
    vocab.ItemHVocab.merge
    vocab.ItemHVocab.parse_yaml
    vocab.ItemHVocab.to_dict
    vocab.ItemHVocab.to_json
    vocab.ItemHVocab.to_yaml
2 items passed all tests:
   3 tests in vocab.HVocabHelper.get_translation_value
   3 tests in vocab.HVocabHelper.get_value
6 tests in 17 items.
6 passed and 0 failed.
Test passed.

Note: While I'm not 100% sure if these docs can be added to tox (see https://stackoverflow.com/questions/49254777/how-to-let-pytest-discover-and-run-doctests-in-installed-modules; did not tested) sees that at least is possible to run manually with with these python3 -m doctest -v hxlm/core/schema/vocab.py.

fititnt commented 3 years ago

WOW, it worked on first try! I'm getting good at this. all that past work related with Ansible test-infra give a hint!

Now automated tests run equivalent to pytest -vv hxlm/ --doctest-modules

Context:

Seems that in theory (see this post/comment https://github.com/pytest-dev/pytest/issues/2042#issuecomment-381309723) pytest does not allow run explicitly both pytest with tests/ folder and the python doctest, But using testinfra we simply simulate running the entire pytest -vv hxlm/ --doctest-modules.

So, it worked!

I will leave here how it's done, since I know it can be used by others on other projects (or at least help a lot my future self)

tests/test_zzz_doctest.py


# (...)

def test_pytest_doctest_modules_all_may_have_false_positives(host):
    """Run pytest -vv hxlm/ --doctest-modules
    WARNING: the test_zzz_doctest.py MAY return false positives (e.g. test
    testdoc code even outside the hxlm module). Consider temporary disable this
    test file and run
        pytest -vv hxlm/ --doctest-modules
    Manually.
    """

    cmd = host.run("pytest -vv hxlm/ --doctest-modules")
    # cmd = host.run("pytest --doctest-modules")

    print('cmd.stdout')
    print(cmd.stdout)

    print('cmd.stderr')
    print(cmd.stderr)

    assert cmd.succeeded

# (...)

hxml/core/util.py Just an example with doctest



# (...)

@lru_cache(maxsize=128)
def load_file(file_path: str, delimiter: str = ',') -> Union[dict, list]:
    """Generic simple file loader (YAML, JSON, CSV) with cache.

    Args:
        file_path (str): Path or bytes for the file
        delimiter (str): Delimiter. Only applicable if is an CSV/TSV like item

    Returns:
        Union[dict, list]: The loaded file result

    >>> import hxlm.core as HXLm
    >>> file_path = HXLm.HDATUM_UDHR + '/udhr.lat.hdp.yml'
    >>> hsilo_example = load_file(file_path)
    >>> hsilo_example[0]['hsilo']['tag']
    ['udhr']
    """

    with open(file_path, 'r') as stream:
        if file_path.endswith('.json'):
            return json.load(stream)
        if file_path.endswith('.yml'):
            return yaml.safe_load(stream)
        if file_path.endswith('.csv'):
            reader = csv.reader(stream, delimiter=delimiter)
            result = []
            for row in reader:
                result.append(row)
            return result

    raise SystemError('Unknow input [' + str(file_path) + ']')

# (...)
fititnt commented 3 years ago

To allow tests with JavaScript, I thin we could use GitHub pages. But markdown from hdp-conventions/README.md make Jekyll sad.

Captura de tela de 2021-04-03 08-25-01

Since HXL-Data-Science-file-formats is an really huge name, I guess we could use, mostly for sake of testing, an subdomain from @EticaAI.

fititnt commented 3 years ago

Ok. The #18 HDPLisp prototype on Racket Platform is starting to get complicated (actually it's my first Racket package, so it's complicated because there is a a lot of mental context switch).

With exception of JavaScript draft, everything else after Tox was implemented is automated tested. So I think that since the Racket prototype version is likely to be the reference, worth some time to make automated testing from start.

This is also likely to save time upfront, both got new people and myself if doing quick updates on several host platforms (python, JavaScript+NodeJD/Browser), Racket.

fititnt commented 3 years ago

Wonderful. Worked on second try and without errors 😍.

Ok. Now some refactoring. I think we could move hxlm/data/ontologia/ to ontologia/ and then make a symbolic link. The ongologia is becoming the most important part of something that could resemble an hdp-toolchain

fititnt commented 2 years ago

It already was done some time ago. Some fixes may still relevant, but closing for now.