testing: How should we handle testing for v1.0

georgemarselis-nvi commented 1 year ago

possible solutions:

re-demultiplex all current runs in a separate directory, compare sha512 file-for-file between current and re-demultiplexed (most optimal about what we will get, but most time-consuming, demultiplexing everything will take a day)

ask Catherine for 20? 30? N? significant runs and run them them

don't handle it, yeet it live

in the case of the two, we can also use those for testing

@karinlag @magnulei @CathrineAB

karinlag commented 1 year ago

Pick max 10, with the help of @CathrineAB. Make sure they are spread in time, different types of projects (or missing projects) etc.

I am not really that concerned about the file contents, I am more concerned about resulting file names and directory structures etc. Thus add a regression test on the naming of files, directories etc.

georgemarselis-nvi commented 1 year ago

sounds like a plan.

karinlag commented 1 year ago

Make it so! (sorry!)

georgemarselis-nvi commented 1 year ago

Today is a good day to release code.

Best Regards,

George Marselis

Linux Engineer

HPC and Bioinformatics @NVI

From: Karin Lagesen @.***> Sent: Wednesday, April 26, 2023 7:30:56 PM To: NorwegianVeterinaryInstitute/DemultiplexRawSequenceData Cc: Marselis, George; Author Subject: Re: [NorwegianVeterinaryInstitute/DemultiplexRawSequenceData] testing: How should we handle testing for v1.0 (Issue #57)

Make it so! (sorry!)

— Reply to this email directly, view it on GitHubhttps://github.com/NorwegianVeterinaryInstitute/DemultiplexRawSequenceData/issues/57#issuecomment-1523801155, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASS7CRBXDL6Q34NLPCBGNTDXDFLVBANCNFSM6AAAAAAXMXA23M. You are receiving this because you authored the thread.Message ID: @.***>

georgemarselis-nvi commented 1 year ago

[gmarselis@molly ~]$ cat /data/rawdata/listofruns /data/rawdata/201120_M06578_0037_000000000-JC72F/ /data/rawdata/201202_M06578_0038_000000000-DB7BY/ /data/rawdata/201204_M06578_0039_000000000-JC24K/ /data/rawdata/201211_M06578_0040_000000000-JF8H4/ /data/rawdata/201218_M06578_0041_000000000-JF7TM/ /data/rawdata/210512_M06578_0051_000000000-JL3FM/ /data/rawdata/210521_M06578_0052_000000000-JKW2V/ /data/rawdata/211105_M06578_0072_000000000-JW2W6/ /data/rawdata/211129_M06578_0076_000000000-DDJ44/ /data/rawdata/211203_M06578_0077_000000000-K2Y5F/ /data/rawdata/211208_M06578_0078_000000000-JVWLG/ /data/rawdata/211214_M06578_0079_000000000-K5JWP/ /data/rawdata/211217_M06578_0080_000000000-K5F58/ /data/rawdata/220127_M06578_0084_000000000-K6MPK/ /data/rawdata/220131_M06578_0085_000000000-DF5PW/ /data/rawdata/220401_M06578_0096_000000000-DG39K/ /data/rawdata/220425_M06578_0099_000000000-K96W8/ /data/rawdata/220603_M06578_0105_000000000-KB7MY/ /data/rawdata/221031_M06578_0120_000000000-DGJ8C/ /data/rawdata/221103_M06578_0121_000000000-KN9M2/ /data/rawdata/221121_M06578_0123_000000000-KP366/ /data/rawdata/221128_M06578_0125_000000000-KPGRW/ /data/rawdata/221201_M06578_0126_000000000-KPMLC/ /data/rawdata/221201_M06578_0126_000000000-KPMLC/ /data/rawdata/221209_M06578_0127_000000000-KR85W/ /data/rawdata/221209_M06578_0127_000000000-KR85W/ /data/rawdata/221220_M06578_0128_000000000-KPJGC/ /data/rawdata/230113_M06578_0129_000000000-KT86C/ /data/rawdata/230120_NB552450_0013_AH5JCKAFX5/ /data/rawdata/230201_M06578_0133_000000000-KTG27/ /data/rawdata/230222_M06578_0135_000000000-KTTN9/ /data/rawdata/230310_M06578_0137_000000000-KV26Y/ /data/rawdata/230324_M06578_0138_000000000-KTTNW/ /data/rawdata/230328_M06578_0139_000000000-DHCMD/ /data/rawdata/230404_M06578_0140_000000000-KV3JW/ /data/rawdata/230413_M06578_0141_000000000-KV4JL/ /data/rawdata/230427_M06578_0144_000000000-GF5BT/

these are the test runs. Unfortunatelly, they take up all the space of my hard drive at work.

I asked Sten to buy me a 4TB nvme so can test.

waiting on control

karinlag commented 1 year ago

Currently blocked on resources to do the testing.

georgemarselis-nvi commented 1 year ago

great news @everyone!

The tests succeeded. I got a couple of questions for @magnulei about some project naming (clean-chickenkilling?) and a small bug which i can fix on the weekend. Otherwise I am ready for version 1.0

Let me know when you want to release.

karinlag commented 1 year ago

lets put that on the plan for monday.

karin

-- Karin Lagesen, PhD Bioinformatician, Section for Epidemiology Norwegian Veterinary Institute

On 17.08.2023 18:58, George Marselis @NVI wrote:

great news @everyonehttps://github.com/everyone!

The tests succeeded. I got a couple of questions for @magnuleihttps://github.com/magnulei about some project naming (clean-chickenkilling?) and a small bug which i can fix on the weekend. Otherwise I am ready for version 1.0

Let me know when you want to release.

— Reply to this email directly, view it on GitHubhttps://github.com/NorwegianVeterinaryInstitute/DemultiplexRawSequenceData/issues/57#issuecomment-1682644060, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZJG4HOXJYIYVBPUXL7QFTXVZESLANCNFSM6AAAAAAXMXA23M. You are receiving this because you were mentioned.Message ID: @.***>

karinlag commented 1 year ago

@georgemarselis-nvi will write up docs on how the testing was done and why it was ok.

@georgemarselis-nvi and @magnulei and @CathrineAB will make a plan for how to deploy.

georgemarselis-nvi commented 1 year ago

How the testing was done:

Inside the /data/rawdata directory there is a file called "testruns" (/data/rawdata/testruns). that is the list of runs agreed upon with @magnulei

This list includes

multiple correct runs with both miseq and nextseq
runs with control projects
failed runs

the script below takes all the above runs and processes them though the demultiplexing script

#!/bin/bash

clear
for run in $( < /data/rawdata/testruns ); do
    rm -rf /data/demultiplex/${run}_demultiplex
    rm -rf /data/for_transfer/${run}
    /usr/bin/python3 /data/bin/demultiplex_script.py $( /usr/bin/basename ${run} )

script is in /data/bin/test.sh

Why was it ok:

Because the visual/manual inspection of the /data/log/*.log looks ok. All bugs seem fixed and the output seems normal and regular.

This should be automated in the future, to do CI/CD but until then, manual inspection is what says it is ok or not.

georgemarselis-nvi commented 1 year ago

Plan for how to deploy:

Monday ( 2023-10-04 ) is a good day as there is no sequencing due to miseq being down.

I will copy the script by hand and run tests on

220131_M06578_0085_000000000-DF5PW 201120_M06578_0037_000000000-JC72F 220603_M06578_0105_000000000-KB7MY

if they behave like they did on my computer, we christen deployment of version 1.0 complete.

karinlag commented 1 year ago

I want a detailed list with steps of what to do, i.e. down to "cp filex location y", with an agreed upon time with M and C of when this should happen.

Also, I know this is kicked off by a cron job, will we edit this cron job to use a new name for this script, or will we do a slide-in replacement?

Also, is the cron script version controlled?

Also, the clue is not in tests of what happens with existing runs, that should (and has been, afaik) tested before, the clue is whether it picks up and runs a new run without any issues. That is the success criteria we are seeking.

georgemarselis-nvi commented 1 year ago

I want a detailed list with steps of what to do, i.e. down to "cp filex location y"

Also, I know this is kicked off by a cron job, will we edit this cron job to use a new name for this script, or will we do a slide-in replacement?

Kind of both. The original script was a symlink "current_demultiplexing_script.py" to the appropriate version, e.g. current_demultiplexing_script_v5.py . All versions were in the same directory, instead of being in version control.

The demultplexing script, both original and to-be-released, is in two parts:

Part A detects new completed rawdata drops by reading the rawdata dir for the names of all the directories that have a registered sequencer serial number^[1] . Then, it compares those directories against the /data/demultiplex directory contents, again, that have a registered sequencer serial number . The first difference it finds, it gets send to Part B as a new RunID. Arvind calculated that demultiplexing should not take more than half an hour and that NVI does not have multiple concurrent rawdata drops^[2] . Process repeats every half hour, so, the lack of process control and lack of concurrency is balanced out by timing the demultiplexing script and giving half an hour between starting a run.

I renamed Part B of the script to "demultiplexing_script.py" , both in Arvind's and my own version. And added /data/bin to github. That was last early last year.

The moment I type "git pull", the changes will be pulled in and the new demultiplexing script (Part B) will be in place.

There is no need to change anything in the cron script right now, or Part A. All nessecary changes in Part A have already been done, though Part A needs more looking after (later on) and incorporated into a single callable module (later on)^[2] .

Also, is the cron script version controlled?

/me points at url at top of browser . There is a copy of Arvind's _v{1-5}.py scripts, if you care to go back through the log. But I have cleaned up the directory structure since then.

with an agreed upon time with M and C of when this should happen.

the exact time down to the minute has not been yet agreed upon. We just said "monday after the sibyl meeting"

[1] This means you can have a billion other directories in there, but unless they have a registered serial number, they are ignored. This used to be an issue as directories as "badruns" were read in, the directory was trying to find the RTAComplete.txt file, failed and aborted since it thought it was a non-completed sequencing run. Which means that the script never got to see any new rawdata drops, as it aborted too early.

[2] I would like to change Part A (and Part B) to being a daemon that watches the filesystem via dbus, but that's for later. That will allow for concurrent demultiplexing attempts and maybe a paper, later on.

georgemarselis-nvi commented 1 year ago

Deployment plan:

[x] we find a time where the sequencers are not running
[ ] cd into seqtech00:/data/bin
[ ] git pull

Positive control is the next time the sequencers run the demultiplexing script runs on its own without human interference.

Regression tests on the script have already been done found on the middle of this script.

The demultiplexing script detects new runs to be processed by diffing the existing demultiplexed runs from the ones current in /data/rawdata

Cron script/crontab is not changing, hence the automatic picking up of data, same way as we have done so far.

georgemarselis-nvi commented 1 week ago

Well, 1.0 is now now delivered.

NorwegianVeterinaryInstitute / DemultiplexRawSequenceData

testing: How should we handle testing for v1.0 #57