Pavell94000 commented 5 years ago

Propose an interface to allow teachers to test a single PL or all the PL in a directory.

nborie commented 5 years ago

For information, here is the strategy of the Sagemath TestSuite :

sage: S = SymmetricGroup(4)
sage: TestSuite(S).run()                    # RUN THE TESTSUITE
sage: TestSuite(S).run(verbose=True)                     # ENABLE THE INSULTS
running ._test_an_element() . . . pass
running ._test_associativity() . . . pass
running ._test_cardinality() . . . pass
running ._test_category() . . . pass
running ._test_codegrees() . . . pass
running ._test_degrees() . . . pass
running ._test_descents() . . . pass
running ._test_elements() . . .
  Running the test suite of self.an_element()
  running ._test_category() . . . pass
  running ._test_eq() . . . pass
  running ._test_new() . . . pass
  running ._test_not_implemented_methods() . . . pass
  running ._test_pickling() . . . pass
  pass
running ._test_elements_eq_reflexive() . . . pass
running ._test_elements_eq_symmetric() . . . pass
running ._test_elements_eq_transitive() . . . pass
running ._test_elements_neq() . . . pass
running ._test_enumerated_set_contains() . . . pass
running ._test_enumerated_set_iter_cardinality() . . . pass
running ._test_enumerated_set_iter_list() . . . pass
running ._test_eq() . . . pass
running ._test_has_descent() . . . pass
running ._test_inverse() . . . pass
running ._test_new() . . . pass
running ._test_not_implemented_methods() . . . pass
running ._test_one() . . . pass
running ._test_pickling() . . . pass
running ._test_prod() . . . pass
running ._test_reduced_word() . . . pass
running ._test_simple_projections() . . . pass
running ._test_some_elements() . . . pass
running ._test_well_generated() . . . pass
sage:

In the previous code, I did instantiate an object S. Thus I ask to run the complete Sage TestSuite adapted to this object. That a lot of stuff !!!!

Question : How Sage did determinate the list of tests needed to be run ? Let us look the source code of the TestSuite :

def run(self, category = None, skip = [], catch = True, raise_on_failure = False, **options):
        """
        ... a lot of doc ...
        """
        if isinstance(skip, str):
            skip = [skip]
        else:
            skip = tuple(skip)

        # The class of exceptions that will be caught and reported;
        # other exceptions will get through. None catches nothing.
        catch_exception = Exception if catch else None

        tester = instance_tester(self._instance, **options)
        failed = []
        for method_name in dir(self._instance):
            if method_name[0:6] == "_test_" and method_name not in skip:
                # TODO: improve pretty printing
                # could use the doc string of the test method?
                tester.info(tester._prefix+"running .%s() . . ."%method_name, newline = False)
                test_method = getattr(self._instance, method_name)
                try:
                    test_method(tester = tester)
                    tester.info(" pass")
                except catch_exception as e:
                    failed.append(method_name)
                    if isinstance(e, TestSuiteFailure):
                        # The failure occured in a nested testsuite
                        # which has already reported the details of
                        # that failure
                        ... MORE MORE CODE ...

The important thing are in the for and the if following the for.... The sage TestSuite look for all method in the object whose name begin by _test_ . Since Python is an introspective language, object now a lot of thing about themselves...

In Premier Langage, if we could have something like : a test suite looking for keys of exercice dictionary whose name begins with xxxxxx (to be chosen...). Value attached to these keys should respect some specifications (for example, value should be python function of the shape : f(dic) --> bool, str with bool the result of the test and str the test verbose).

With such testing protocol, user and teacher could define, overload or inherit from template a lot of tests... You should think about that...

qcoumes commented 5 years ago

Since the grader only receive a dictionnary mapping each input field to the given answers, an exercise would have to only provide a list of tuples (answers, expected_grade).

Example: For an exercise asking to give the result of 5 + 6, where the input ID is form_answer, the tests inputs could be :

{"anwer": "11"} 100
{"anwer": "9"} 0
{"anwer": "string"} -1

Now, the main problem is to find a key and the easiest syntax for users to provide these tuples.

My idea is to write a json such as :

tests =%
{
    "test_name1": {
        "answers": {
            "form_id1": "value1",
            "form_id2": "value2",
            "form_id3": "value3"
        },
        "grade": [VALUE]
     },
    "test_name2": {
        "answers": {
            "form_id1": "value1",
            "form_id2": "value2",
            "form_id3": "value3"
        },
        "grade": [VALUE]
     },
    "test_name3": {
        "answers": {
            "form_id1": "value1",
            "form_id2": "value2",
            "form_id3": "value3"
        }
        "grade": [VALUE]
     }
}
==

Taking above example, it would be:

tests =%
{
    "good": {
        "answers": {
            "answer": "11"
        },
        "grade": 100
    },
    "bad": {
        "answers": {
            "answer": "9"
        },
        "grade": 0
    },
    "not_int": {
        "answers": {
            "answer": "string"
        },
        "grade": -1
    }
}
==

Where answer have to be either a string or a list of string (as these are the only elements returned from an HTML form).

Pros
- Not ambiguous (No need to use a specific character to delimits the answers and the expected grade)
- Tests are named
Cons
- Annoying to write

If you have another idea of syntax to suggest ?

@Pavell94000 @nborie @nimdanor @nthiery @mciissee @plgitlogin

qcoumes commented 5 years ago

Another idea, using namespace:

tests.test_name1.answers =%
 {
    "form_id1": "value1",
    "form_id2": "value2",
    "form_id3": "value3"
 }
==
tests.test_nam1.grade = [GRADE]

tests.test_name2.answers =%
 {
    "form_id1": "value1",
    "form_id2": "value2",
    "form_id3": "value3"
 }
==
tests.test_nam2.grade = [GRADE]

Still using the same example, it would be :

tests.good.answers =%
 {
    "answer": "100"
 }
==
tests.good.grade = 100

tests.bad.answers =%
 {
    "answer": "0"
 }
==
tests.bad.grade = 0

tests.not_int.answers =%
 {
    "answer": "string"
 }
==
tests.not_int.grade = -1

Pros
- Not ambiguous (No need to use a specific character to delimits the answers and the expected grade)
- Tests are named
- Kinda easier to write
Con
- Harder to read as tests are separated

@Pavell94000 @nborie @nimdanor @nthiery @mciissee @plgitlogin

qcoumes commented 5 years ago

Another way to see it : Dot not enforce any syntaxe, as long as the parsed result is the same (which is the case for the two syntaxe I gave above.

@Pavell94000 @nborie @nimdanor @nthiery @mciissee @plgitlogin

nthiery commented 5 years ago

I like the idea of multiple independent sections that can be extended / overridden (tests.foo, tests.blah); this would serve the same purposes as the generic tests we have in Sage that @nborie pointed out.

nthiery commented 5 years ago

At such an early stage, I would I would tend to aim for maximal flexibility, providing a programmatic interface, in the same spirit as nbgrader. I.e. provide an easy way to run an exercise on a given input set, and let the teacher analyse the output as (s)he wishes, and raise an error if something went wrong. Something like

tests.good_grade==
assert grade( {"answer": 11}) == 100
==
tests.bad_grade==
assert grade( {"answer": 11}) == 100
==

It will always be time to introduce syntactic sugar for the most common use cases when they emerge.

Some compatibility with python's unit testing frameworks could be nice as well.

plgitlogin commented 5 years ago

I think the right way is to save preview executions made by hand. Like the doctest of python. You try your exercice and you save the test. If the grade is the one you want the test is good.

nimdanor commented 5 years ago

Yes this is a simple way for the teacher. When he tests is exercice we record the answers and the given grade. This yields a non-regression test. With good and bad answers. This can be extended to feedbacks ... or parts of feedbacks.

nimdanor commented 5 years ago

In this way the format of the test canbe a little more complexe and more open to change. A json aproche to the tests, with version and all ...

nimdanor commented 5 years ago

@qcoumes

Si tu touche au preview il faut que tu regarde l'idée de sauvegarder une expérience de preview comme un test. Il faut ajouter un bouton pour cela. Si tu est juste sur ce truc la tu devrai en profiter pour faire les tests des deux trucs en même temps.

Si tu fait cela il faut dans le test conserver les élements suivants :

seed
answers
grade
feedback et peut être aussi
le nouveau contexte.

Avec le test en suite qui vérifie l'égalité du grade et du feedback et peut être aussi du nouveau contexte.

nimdanor commented 5 years ago

I think what @qcoumes proposed tests%{ "testname": { "answer": ... , "grade": ...., "seed": ..., "context": ..., "feedback" :..., }, } is good, the fact that the syntax permits to write it : tests.testname%{ "answer": ... , "grade": ...., "seed": ..., "context": ..., "feedback" :..., } tests.testname.seed=33 is also good.

For the testing part, it should be a best effort approach :

The minimal test is only "answer" and "grade". If there is a seed then apply it first. if there is a feedback check the feedback also if there is a context check the context also

nborie commented 5 years ago

Désolé, c'est en francais... C'est long et je n'ai pas envie de la traduire (Je vais perdre 50% du sens aussi...).

J'énonce trois points discutables mais j'en suis relativement convaincu

(1) Renseigner deux fois la même information pour un humain est contre-productif et insupportable (2) Tester un maximum pour avoir de la robustesse (3) Ne pas sous-estimer les capacités de review humaines des enseignants (la review(exemple correction de copies) est au coeur de notre métier)

Se baser uniquement sur (3) avec une édition centralisée et assuré par une seule personne est une connerie, surtout pour des informaticiens. On aurait l'impression de nous auto-renier... (2) est sympathique seulement si les tests ont du sens.

Test des productions d'enseignants :

TESTS AUTOMATIQUES --> Les informations permettant de construire les tests sont déjà disponible ailleurs.

Un QCM --> rien à faire ou seulement des choses automatiques car le point (1) s'applique. L'information des choix possibles, les bons et les mauvais, c'est contenu dans le grader.
Un exercice de programmation (langage-free) avec fourniture d'une solution par un enseignant. --> rien à faire coté rédacteur selon mon avis. Il y a un seul test à mettre en place : le test qui vérifie que la solution du prof vaut 100/100.
Un exercice avec conseil et remédiation (je parle ici d’exos avec "Si l'élève répond ça alors je lui donne tel ou tel indice") --> Pareil, rien à rajouter niveau tests. Tout doit pouvoir être extrait automatiquement et des tests automatiques doivent pouvoir s'implanter génériquement si les setter/getter des templates sont bien conçus. Si le grader et feedback peuventt le faire, pourquoi un enseignant devrait se répéter.
Un exo aléatoire. On veut vérifier que l'exercice se génère correctement et que son grader fonctionne lui aussi correctement pour les instances crées. Encore une fois, les informations sont déjà plus ou moins là ailleurs déjà renseignées.

TESTS MANUELS --> L'enseignant rédacteur va rajouter des choses manuellement dans un exo car il veut absolument vérifier qu'un certains comportement doit provoquer un certain retour.

Un exercice dont la réponse est une classe d'équivalence (exemple : 2 , 2/1, 4^(1/2), ...). Typiquement, un exo de math où il faut fournir une primitive d'une autre fonction. Ici, il me parait possible qu'un enseignant rédacteur veuillent checker des inputs/outputs sur ses graders. D'un certains coté, ces tests risques d'être des tests sympy, numpy, sage, ... cachés. Après, il n'y a pas que les maths où la bonne réponse n'est pas unique.
?????????

Je me creuse la tête en me demandant : "Quand est ce qu'on a pas le choix ?", "Quand est-ce que l'on a besoin de préciser ?", ... J'ai quand même l'impression que la plupart des tests vont vivrent dans des templates. Tous mes exercices de C ayant une clé 'solution' vont testables sans rien faire. Je ne veux pas rajouter de code dans mes exos, je veux juste mettre à jour mon template.

Je me demande si en plus du builder et du grader, il ne serait pas bon de mettre en placer un checker sur un modèle un peu semblable : un exécutable Python qui prend en argument des choses et produit un feedback pour l'enseignant éditeur... Je pense, que, comme builder/grader, ce sont les gens qui écrivent les templates qui devraient être responsables de l'analyse des choses pouvant être tester automatiquement. Ce sont les mieux placés pour savoir.

Ce que PL doit fournir, c'est un support pour crawler la BDD, extraire les contenus dont ceux qui possèdent des ==checker== , lancer les tests et consigner l'information. Et puis ce genre de script peut à la fois tourner une fois par jour ou encore être appelé par un enseignant éditeur qui traficote dans un template et voir s'il n'a pas tout casser.

À l'échelle de PL, faire des tests globaux sur un exercice me parait pas relevant (On va pas programmer un oeil avec opencv pour vérifier que les browser affiche bien les énoncés...). On va produire des tests capillotracté avec le point (2) comme dogme. Les enseignants éditeurs écrivant des templates sont probablement les mieux placés pour mettre en place les tests. NOTAMENT, certains builder fabriquent les exos et leurs réponses et même temps. Donc le builder calibre le grader à venir. Il me parait complètement naturel qu'un builder puisse fabriquer un grader puis aussi calibrer un checker dans la foulée. C'est pas au core développeur de PL d'imaginer les tests d’exos. J'attendrais plus de Quentuin et Christophe une machinerie pour mouliner des exos, mouliner des directories, mouliner des repo git, et puis ça envoie des mails. Il y a un bouton cliquable pour tester tous les exos qui hérite d'un template, etc... En fait, je veux une API de tests flexibles, pas des tests d'exos ciblés...

La seule chose qui pourrait me paraître générique à l’échelle de PL (matière-free : info, math, histoire, natation théorique, ...), c'est une connerie à la assert. Du style, un crawler lit toute la bdd, cherche tous les exos possédant clés 'assert_macheprout' qui doit être un script ou une fonction va retourne un booléen et finalement on vérifie que ça donne True pour tout le monde. Ça peut marcher mais une telle approche est tellement sémantiquement pauvre que j'en veux pas.

PL-server doit penser API de tests et les enseignants (éditeur ou pas) doivent penser automatisation du contrôle de la robustesse pour leur exos. Plus j’y pense et plus je me dis que c’est le plus propre/censé… Aussi je pense que l’on veut très peu de tests manuels sauf pour des cas très précis…

PAVÉ CAESAR !

qcoumes commented 5 years ago

I agree with most of the ideas, that's why I proposed a format for the tests (json). The plateform don't care about where the json come from (a template, a pl, from the builder) as long as it is there after the build step.

This would allow templates for "closed" exercises such as QCM to make the tests, and templates for more "open" exercises, such as math or programmation to make some generic tests, while still allowing the pl itself to add some.

In the other hands, I didn't really understand what you meant with the checker, what it is supposed to test?

plgitlogin commented 5 years ago

@nborie Sold. Does the following rewritting is good for you ?

Two types of tests:

generic which is defined on a template that allows it, like that of the Cbank template that uses a "Soluce", in this case all the exercises that are defined on this template are "testable", (note if the Soluce is wrong the test goes very well thank you). => creation of a test procedure which runs through all the exercises (in one way or another) and which according to the inheritance links (extends / template) launches the right test on each exercise.
Advantages: a single test writing for all the exos of this model.
Difficulty: the type of exercise does not necessarily allow to write a generic test.

specific the property "tests" defined above (by @qcoumes) which allows just to check a preci exercise. This makes it possible to check that the modification of a template does not break the exercises that are defined on it. => creation of a test procedure that runs through all the exercises that depend on the template being modified and performs the tests on each exercise.
Advantage: A generic procedure for all cases.
Disappointment: it is necessary to provide by hand, for each exercise, a property "tests". But with the test recording system in the preview mode, this should not be too painful. We can add an obligation to save a test property in each exercise, which does not have a generic test in its template.

(1) Renseigner deux fois la même information pour un humain est contre-productif et insupportable Yes that's a must. (2) Tester un maximum pour avoir de la robustesse Yes to that also (3) Ne pas sous-estimer les capacités de review humaines des enseignants (la review(exemple correction de copies) est au coeur de notre métier) Yes so let's use two things:

the preview ( this is a review ;)

post-use-analisys , meaning tests based on the students answers.

Test des productions d'enseignants :

TESTS AUTOMATIQUES --> Les informations permettant de construire les tests sont déjà disponible ailleurs. generiques

Un QCM --> rien à faire ou seulement des choses automatiques car le point (1) s'applique. L'information des choix possibles, les bons et les mauvais, c'est contenu dans le grader. Il peut y avoir encore des erreurs mais c'est dans le buildre qu'elles sont détéctées donc pas de sousci pour l'élève. Eventuellement nous voulons tester si le resltat graphique et conforme ..... :(

Un exercice de programmation (langage-free) avec fourniture d'une solution par un enseignant. --> rien à faire coté rédacteur selon mon avis. Il y a un seul test à mettre en place : le test qui vérifie que la solution du prof vaut 100/100. generiques

Un exercice avec conseil et remédiation (je parle ici d’exos avec "Si l'élève répond ça alors je lui donne tel ou tel indice") j'ai pas d'exercice de référence pour cela !!!!!! Done moi des exemples !! Pareil, rien à rajouter niveau tests. Tout doit pouvoir être extrait automatiquement et des tests automatiques doivent pouvoir s'implanter génériquement si les setter/getter des templates sont bien conçus. Si le grader et feedback peuventt le faire, pourquoi un enseignant devrait se répéter. Ma question est : peut faire un test générique sur ce type d'exo si oui Bonne Nouvelle !!

Un exo aléatoire. On veut vérifier que l'exercice se génère correctement et que son grader fonctionne lui aussi correctement pour les instances crées. Encore une fois, les informations sont déjà plus ou moins là ailleurs déjà renseignées.

TESTS MANUELS --> L'enseignant rédacteur va rajouter des choses manuellement dans un exo car il veut absolument vérifier qu'un certains comportement doit provoquer un certain retour.

Un exercice dont la réponse est une classe d'équivalence (exemple : 2 , 2/1, 4^(1/2), ...). Typiquement, un exo de math où il faut fournir une primitive d'une autre fonction. Ici, il me parait possible qu'un enseignant rédacteur veuillent checker des inputs/outputs sur ses graders. D'un certains coté, ces tests risques d'être des tests sympy, numpy, sage, ... cachés. Après, il n'y a pas que les maths où la bonne réponse n'est pas unique.

Je me creuse la tête en me demandant : "Quand est ce qu'on a pas le choix ?", "Quand est-ce que l'on a besoin de préciser ?", ... J'ai quand même l'impression que la plupart des tests vont vivrent dans des templates. Tous mes exercices de C ayant une clé 'solution' vont testables sans rien faire. Je ne veux pas rajouter de code dans mes exos, je veux juste mettre à jour mon template.
YES c'est kool qu'ils soient testables mais ce n'est pas toujours le cas.
Je me demande si en plus du builder et du grader, il ne serait pas bon de mettre en placer un checker sur un modèle un peu semblable : un exécutable Python qui prend en argument des choses et produit un feedback pour l'enseignant éditeur... Je pense, que, comme builder/grader, ce sont les gens qui écrivent les templates qui devraient être responsables de l'analyse des choses pouvant être tester automatiquement. Ce sont les mieux placés pour savoir.

Oui c'est une bonne idée, que l'on puisse lancer les tests génériques sur l'exercice en cours de création.

Ce que PL doit fournir, c'est un support pour crawler la BDD, extraire les contenus dont ceux qui possèdent des ==checker== , lancer les tests et consigner l'information. Et puis ce genre de script peut à la fois tourner une fois par jour ou encore être appelé par un enseignant éditeur qui traficote dans un template et voir s'il n'a pas tout casser.

À l'échelle de PL, faire des tests globaux sur un exercice me parait pas relevant (On va pas programmer un oeil avec opencv pour vérifier que les browser affiche bien les énoncés...). On va produire des tests capillotracté avec le point (2) comme dogme. Les enseignants éditeurs écrivant des templates sont probablement les mieux placés pour mettre en place les tests. NOTAMENT, certains builder fabriquent les exos et leurs réponses et même temps. Donc le builder calibre le grader à venir. Il me parait complètement naturel qu'un builder puisse fabriquer un grader puis aussi calibrer un checker dans la foulée. C'est pas au core développeur de PL d'imaginer les tests d’exos. J'attendrais plus de Quentuin et Christophe une machinerie pour mouliner des exos, mouliner des directories, mouliner des repo git, et puis ça envoie des mails. Il y a un bouton cliquable pour tester tous les exos qui hérite d'un template, etc... En fait, je veux une API de tests flexibles, pas des tests d'exos ciblés...

IL faut ajouter des éléments liés aux tests dans les interfaces.
Un bouton tests dans l'interface d'édition qui lance le tests générique du template courrant, et si l'exercice est un template cela lance les tests sur les descendants.

Il nous faut un organistion des liens de dépendance entre exercices (DANS LES DEUX SENS).

La seule chose qui pourrait me paraître générique à l’échelle de PL (matière-free : info, math, histoire, natation théorique, ...), c'est une connerie à la assert. Du style, un crawler lit toute la bdd, cherche tous les exos possédant clés 'assert_macheprout' qui doit être un script ou une fonction va retourne un booléen et finalement on vérifie que ça donne True pour tout le monde. Ça peut marcher mais une telle approche est tellement sémantiquement pauvre que j'en veux pas.
Ouais j'en veux pas non plus

PL-server doit penser API de tests et les enseignants (éditeur ou pas) doivent penser automatisation du contrôle de la robustesse pour leur exos. Plus j’y pense et plus je me dis que c’est le plus propre/censé… Aussi je pense que l’on veut très peu de tests manuels sauf pour des cas très précis…

Avé pavé MORITURI TE SALUTANT

nimdanor commented 5 years ago

I agree with @plgitlogin ;)

qcoumes commented 5 years ago

To make a little summary.

Tests must be present in the key tests and must be a dictionnary respecting this format:

{
    "[test name]": {
        "answers": {
            "[form id]": "[value]",
            "[form id]": "[value]",
            .
            .
            .
        },
        "grade": [expected grade],
        "feedback": "[expected feedback]",
        "seed": [expected seed]
    },
    "[test name]": {
        "answers": {
            "[form id]": "[value]",
            "[form id]": "[value]",
            .
            .
            .
        },
        "grade": [expected grade]
    },
    .
    .
    .
}

Where:

Every key ([est name, "answers"...) must be a string "string"
value must be either a string "string" or a list of string ["a", "list", "of", "string"]
expected grade must be an integer in [-1, 100].
feedback is optionnal and must be a string "string"
seed is optionnal and must be an integer or a float.

This dictionnary can be created with different method:

In the PL

One can use any of the current operator provided by the PL syntax: JSON syntax

dic = %
[JSON]
==

or namespace syntax

dic.sub_dic1.sub_dic2.sub_dicX.key = value

Both of these method can be used together - it allow to create a dictionnary with the =% operator and to edit / expand it with the namespace :

tests =%
{
    "test_1": {
        "answers": {
            "answer": "my answer",
        }
}
==

tests.test_1.answers.answer = "My new answer" # Edit the dictionnary
tests.test_1.grade = 100 # Add new key to the dictionnary

tests.test_2 =% # Adding a new test
{
    "answers": {
        "answer": "my answer",
    },
    "grade": 10
}
==

This allow differents components of tests to be extended / overridden as noted by @nthiery.

In the builder

The dictionnary can also be created programmaticaly inside the builder, as most of the time, that answers are already present somewhere inside the PL, as noted by @nborie. This allow to make tests inside the template so that the teacher using it don't have to bother with tests.

Caution : Since the builder is executed after the PL is parsed (obviously), an user wishing to extends generics tests of the template would in fact create the dictionnary first. Therefore, the user creating the template have to first check if some tests already exists and appends the generic tests to them if this is the case.

Inside the preview

As noted by @nimdanor, tests could also be created through the preview interface, by creating a test according to what the user type in the forms and which grade he receive.

nthiery commented 5 years ago

For whatever it's worth: the whole discussion makes sense to me; I still recommend enabling tests written as code -- and not just as data. This is much more flexible in particular for authoring generic tests.

nborie commented 5 years ago

As there exists a None builder (the builder doing nothing). I can imagine a generic unknow exercice has the None checker (the checker checking nothing).

Since I did implement a template stdsandboxC.pl (standard exercice of C using a sandbox), I did choice a lot of things on my own. I did choice to define sometimes a key 'solution' or if I do not provide a 'solution', I provide a 'expected_output' key. My grader do a specific job, call a compiler, etc... My template should fit to all standard exercice of C with small piece of code but my template is very specific compare to all that can be found in PremierLangage.

These choice are mine and belong to my template. Therefore, here is the proper way to check that : (1) parse the exercice, establish the dictionnary (2) search for the key solution in the dict (3) build my exercice with its builder (4) since I use editor.code (customize with c coloration), place the content of the key 'solution' inside the form editor.code (5) call my grader (6) check all tests passes and the grade is 100 / 100 and return True if it is the case. (7) if returns False, update something on stderr to allow the teacher to proceed further debugging

This procedure is completely template dependant. It is coherent with my choices (all these choices already live in my template). This is why I really think that coherence checks should be establish by template writers inside their template.

From that, anyone can use my stdsandboxC.pl template and thus get tests for free. Or because he is implementing a new template randomsandboxC.pl, he will inherit from stdsandboxC.pl but override my 'checker' key with new code doing the same job with 4 different seed because he loves the number 4.

Currently we have : builder --> python3 builder.py CONTEXT_FILE MODIFIED_CONTEXT_FILE grader --> python3 builder.py CONTEXT_FILE ANSWERS_FILE MODIFIED_CONTEXT_FILE FEEDBACK_FILE

So, I imagine a ==checker== environment of code such that checker --> python3 checker.py CONTEXT_FILE EXTRA_CONTEXT_INFORMATION LOG_FILE

The first argument would be the exercise dict, after the user can provide an extra json dict for specific or manual test and finaly a log file that can be send by mail if using during the night....

The core PL server could know provide nice API to automatize the massive launching of these tests.

I am trying to imagine something coherent with all that have been done and our identified use-cases...

qcoumes commented 5 years ago

As there exists a None builder (the builder doing nothing). I can imagine a generic unknow exercice has the None checker (the checker checking nothing).

This isn't really intended, builder shoud be optionnal, but right now PL does require a grader, so we use none.py as a workaround. Indeed the checker would also be optionnal.

nborie commented 5 years ago

This is perhaps time to design an API for that : an API allowing teacher to test their PL...

Perhaps a new key inside template and/or some pl exercise ! Anyway, teachers should provide how to test their exercise and not the core devs....

check_senario== [ [(json_fictive_student_answer1, expected_grade1, expected_feedback1)]

[(json_fictive_student_answer2, expected_grade2, expected_feedback2)]

[(json_fictive_student_answer3, expected_grade3, expected_feedback3)]

[(json_fictive_student_answer_first_try, expected_grade_first_try, expected_feedback_first_try), (json_fictive_student_answer_second_try, expected_grade_second_try, expected_feedback_second_try)] ]

The above proposition is just horrible... But on one side, I see DR shouting "MAKE A TEST BORDEL !" but I really think that the teachers should provide tests of senario inside their template. I am currently the one knowing how to test a QCM or a C exercise. And, at each new version (or new commit), these massive test on all exercises could be run.

think about a design for that please DR ! allowing a teacher to provide some scenarii of tests is the good way for making PL more robust.

nborie commented 5 years ago

Currently, both prod and preprod are broken with the update of some keys(text, form, ...)... This will not append with such tests.... Also, as an exercise writer, I will be very happy to be allowed to test my new exercise in this way...

nborie commented 5 years ago

Also, with such scenario test embedded in exercise, we could win a lot with the "MCO" of resources...

Currently when you change a template, how can you be sure that you don't break something inheriting from the template you are editing ?

Typically, in Sage, every new object must have coherence tests and the guy implementing the new object is responsible of providing the nice tests allowing Sage to keep its global coherence. It is easy to break something you do not know the existence and Sage has more than 2 millions of Python code (today no one knows all Sage...). But when something brokes another thing, it is not the release manager who meet the problem first.

nthiery commented 5 years ago

This is perhaps time to design an API for that : an API allowing teacher to test their PL... teachers should provide how to test their exercise and not the core devs....

+1!

Just to add one data point: in my use case (cpp-info111), I should be able most of the time to build automatically the expected answer from the exercise data. And probably as well a wrong answer (though that's trickier harder). So it would be good if the API would enable calling functions from the builder to construct the scenario.

Of course this would just make for self-consistency tests. They should be complemented by some hand-crafted tests for selected exercises.

nimdanor commented 5 years ago

Two different things:

testing the server :

They should be complemented by some hand-crafted tests for selected exercises. the selected exercices are choosen for the elements in the serveur needed to run them.
An API for testing the exercices: I like @nborie checker idea.
One of the standard checker can use the test api proposed by @qcoumes with a declarative syntax. where we define tests.name.seed= tests.name.response%= # any thing goes here depends on the exercice json of the student answers

tests.name.grade= tests.name.feeback= tests.name.feedbackcheck= def check(feedback): return True # checks the content of the feedback etc. See previous comment.

nimdanor commented 5 years ago

Step 1: We need to add the checker functionnality to the sandbox.

Step 2: The editor test button must do two things :

run the checker and then
run the preview in test mode

Step 3: When creating a acvtivity based on exercices for now only pltp all the exercices must be checked before creating the activity if one of test fail -> error messages and activity not created. [maybe add a failsafe mode (only the non failling exercices are added to the activity, and warning messages for the others).

Step 4: (i am not sure of the usefullness of this) A testing run that checks all the exercices of a part of the plateforme (sub dir). An create a résumé of all problèmes in a html format or a jenkins or else.

nimdanor commented 5 years ago

Step 1: We need to add the checker functionnality to the sandbox.

Step 2: The editor test button must do two things :

run the checker and then
run the preview in test mode

Step 3: When creating a acvtivity based on exercices for now only pltp all the exercices must be checked before creating the activity if one of test fail -> error messages and activity not created. [maybe add a failsafe mode (only the non failling exercices are added to the activity, and warning messages for the others).

Step 4: (i am not sure of the usefullness of this) A testing run that checks all the exercices of a part of the plateforme (sub dir). An create a résumé of all problèmes in a html format or a jenkins or else.

nimdanor commented 5 years ago

Ping

Pavell94000 commented 4 years ago

The pl should be reloaded when clicking run tests (and not using the pl compiled for the filebrowser preview)

We still need a button to add the last run of a preview as a test

PremierLangage / premierlangage

Allow teachers to test their PL #99

To make a little summary.

In the PL

In the builder

Inside the preview

[(json_fictive_student_answer_first_try, expected_grade_first_try, expected_feedback_first_try), (json_fictive_student_answer_second_try, expected_grade_second_try, expected_feedback_second_try)] ]

An API for testing the exercices: I like @nborie checker idea.
One of the standard checker can use the test api proposed by @qcoumes with a declarative syntax. where we define tests.name.seed= tests.name.response%= # any thing goes here depends on the exercice json of the student answers

PremierLangage / premierlangage

Allow teachers to test their PL #99

To make a little summary.

In the PL

In the builder

Inside the preview

[(json_fictive_student_answer_first_try, expected_grade_first_try, expected_feedback_first_try), (json_fictive_student_answer_second_try, expected_grade_second_try, expected_feedback_second_try)] ]

An API for testing the exercices: I like @nborie checker idea. One of the standard checker can use the test api proposed by @qcoumes with a declarative syntax. where we define tests.name.seed= tests.name.response%= # any thing goes here depends on the exercice json of the student answers

An API for testing the exercices: I like @nborie checker idea.
One of the standard checker can use the test api proposed by @qcoumes with a declarative syntax. where we define tests.name.seed= tests.name.response%= # any thing goes here depends on the exercice json of the student answers