Outage Report: Grading

Grading of student submissions was stuck on sent from recent days until the 2022/09/28.

The cause of the problem was a faulty kubectl version as described here https://github.com/aws/aws-cli/issues/6920

The solution

change kubectl version as described here https://github.com/aws/aws-cli/issues/6920#issuecomment-1193399988. By changing the kubeconfig in the portal deployment repo
change value of STUDENT_REPO_NAME in django configmap
add variable BATCH_NAME in django configmap
https://github.com/LDSSA/portal-deployment/compare/master...fix-grading

Some code and notes from the debugging process

Look at logs Look at code, check what happens when a student submits slu00 - write notes about this Student submits the LU here - https://portal.lisbondatascience.org/academy/student/units/SLU00/ portal/academy/views.py BaseUnitDetailView settings.GRADING_CLASS portal.grading.services.AcademyKubernetesGrading

Submit slu00 Can i run portal locally and test submissions that way? Ver se a imagem de container que lhe digo para usar no grading existe mesmo https://hub.docker.com/repository/docker/ldssa/batch5-slu00 Im now running portal locally, will test submissions locally before i deploy to prod

Next step: run grading in a django shell

Tried changing the kubeconfig in the dockerfile, but the file does not exist at that time, so maybe django creates it at some point later? When? Where? How can i modify it at that time?

I can change the kubeconfig in the portal deployment repo, then apply it to the cluster, lets see if that works

Also, i should not have the batch name in constance, instead i should use the the configmap and an environment variable There are some outdated environment variables in the configmaps (like bucket names and such)


kubectl exec -ti $(kubectl get pods -l app=django -o custom-columns=:metadata.name | tail -n +2 | head -1) -- bash

source docker/production/django/entrypoint

python manage.py shell

from datetime import datetime, timezone
from portal.academy import models, serializers
from django.conf import settings
from rest_framework.settings import import_string

# get user

from portal.users.models import User
user = User.objects.filter(username='mig_student').first()

# get unit

unit = models.Unit.objects.filter(code='SLU00').first()

# get grade object

grade = models.Grade(user=user, unit=unit)
grade.save()

Grading = import_string(settings.GRADING_CLASS)
grading = Grading(grade=grade)

image = grading.get_image()
name = grading.get_name()
env = grading.get_env()
cmd = grading.get_command(image, name, env)
grading.start_message()
cmd_str = ' '.join(cmd)
cmd_str
grading.run_grading()

def post(self, request, *args, **kwargs):
    unit, _, _ = self.get_object()
    grade = models.Grade(user=self.request.user, unit=unit)

    if not unit.checksum:
        raise RuntimeError("Not checksum present for this unit")

    # Grade sent on time?
    due_date = datetime.combine(
        unit.due_date, datetime.max.time(), tzinfo=timezone.utc
    )
    grade.on_time = datetime.now(timezone.utc) <= due_date

    # Clear grade
    grade.status = "sent"
    grade.score = None
    grade.notebook = None
    grade.message = ""
    grade.save()

    # Send to grading
    Grading = import_string(settings.GRADING_CLASS)
    Grading(grade=grade).run_grading()

    return HttpResponseRedirect(request.path_info)

apiVersion: v1
clusters:

* cluster:
    certificate-authority-data: example
    server: <https://A78856EFA2476BE5993EC48CFA975782.yl4.eu-west-1.eks.amazonaws.com>
  name: arn:aws:eks:eu-west-1:036806565123:cluster/portal-batch4
contexts:
* context:
    cluster: arn:aws:eks:eu-west-1:036806565123:cluster/portal-batch4
    user: arn:aws:eks:eu-west-1:036806565123:cluster/portal-batch4
  name: portal-batch4
current-context: portal-batch4
kind: Config
preferences: {}
users:
* name: arn:aws:eks:eu-west-1:036806565123:cluster/portal-batch4
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1alpha1
      args:
      - --region
      - eu-west-1
      - eks
      - get-token
      - --cluster-name
      - portal-batch4
      command: aws

kubectl run slu00-migstudent-tmrcvgku --restart=Never --rm -i --image=ldssa/batch5-slu00 --env LDSA_TOKEN=example --env PORTAL_TOKEN=example --env PORTAL_GRADING_URL=<https://portal.lisbondatascience.org/grading/academy/grade/192/> --env PORTAL_CHECKSUM_URL=<https://portal.lisbondatascience.org/grading/academy/checksums/SLU00/> --env DEPLOY_KEY="-----BEGIN PRIVATE KEY-----|example|-----END PRIVATE KEY-----" --env CODENAME=SLU00 --env USERNAME=buedaswag --env REPO_NAME=batch6-workspace

kubectl describe configmaps django-configmap
  apply -f configmaps/kubeconfig-configmap.yaml

# in the end just needed to update the kubeconfig, by applying the confgigmap in the portal deployment repo, replacing v1alpha1 with v1beta1

LDSSA / portal

Outage Report: Grading #241

Outage Report: Grading

Some code and notes from the debugging process