Closed georgemarselis-nvi closed 1 week ago
Pick max 10, with the help of @CathrineAB. Make sure they are spread in time, different types of projects (or missing projects) etc.
I am not really that concerned about the file contents, I am more concerned about resulting file names and directory structures etc. Thus add a regression test on the naming of files, directories etc.
sounds like a plan.
Make it so! (sorry!)
Today is a good day to release code.
Best Regards,
George Marselis
Linux Engineer
HPC and Bioinformatics @NVI
From: Karin Lagesen @.***> Sent: Wednesday, April 26, 2023 7:30:56 PM To: NorwegianVeterinaryInstitute/DemultiplexRawSequenceData Cc: Marselis, George; Author Subject: Re: [NorwegianVeterinaryInstitute/DemultiplexRawSequenceData] testing: How should we handle testing for v1.0 (Issue #57)
Make it so! (sorry!)
— Reply to this email directly, view it on GitHubhttps://github.com/NorwegianVeterinaryInstitute/DemultiplexRawSequenceData/issues/57#issuecomment-1523801155, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASS7CRBXDL6Q34NLPCBGNTDXDFLVBANCNFSM6AAAAAAXMXA23M. You are receiving this because you authored the thread.Message ID: @.***>
[gmarselis@molly ~]$ cat /data/rawdata/listofruns /data/rawdata/201120_M06578_0037_000000000-JC72F/ /data/rawdata/201202_M06578_0038_000000000-DB7BY/ /data/rawdata/201204_M06578_0039_000000000-JC24K/ /data/rawdata/201211_M06578_0040_000000000-JF8H4/ /data/rawdata/201218_M06578_0041_000000000-JF7TM/ /data/rawdata/210512_M06578_0051_000000000-JL3FM/ /data/rawdata/210521_M06578_0052_000000000-JKW2V/ /data/rawdata/211105_M06578_0072_000000000-JW2W6/ /data/rawdata/211129_M06578_0076_000000000-DDJ44/ /data/rawdata/211203_M06578_0077_000000000-K2Y5F/ /data/rawdata/211208_M06578_0078_000000000-JVWLG/ /data/rawdata/211214_M06578_0079_000000000-K5JWP/ /data/rawdata/211217_M06578_0080_000000000-K5F58/ /data/rawdata/220127_M06578_0084_000000000-K6MPK/ /data/rawdata/220131_M06578_0085_000000000-DF5PW/ /data/rawdata/220401_M06578_0096_000000000-DG39K/ /data/rawdata/220425_M06578_0099_000000000-K96W8/ /data/rawdata/220603_M06578_0105_000000000-KB7MY/ /data/rawdata/221031_M06578_0120_000000000-DGJ8C/ /data/rawdata/221103_M06578_0121_000000000-KN9M2/ /data/rawdata/221121_M06578_0123_000000000-KP366/ /data/rawdata/221128_M06578_0125_000000000-KPGRW/ /data/rawdata/221201_M06578_0126_000000000-KPMLC/ /data/rawdata/221201_M06578_0126_000000000-KPMLC/ /data/rawdata/221209_M06578_0127_000000000-KR85W/ /data/rawdata/221209_M06578_0127_000000000-KR85W/ /data/rawdata/221220_M06578_0128_000000000-KPJGC/ /data/rawdata/230113_M06578_0129_000000000-KT86C/ /data/rawdata/230120_NB552450_0013_AH5JCKAFX5/ /data/rawdata/230201_M06578_0133_000000000-KTG27/ /data/rawdata/230222_M06578_0135_000000000-KTTN9/ /data/rawdata/230310_M06578_0137_000000000-KV26Y/ /data/rawdata/230324_M06578_0138_000000000-KTTNW/ /data/rawdata/230328_M06578_0139_000000000-DHCMD/ /data/rawdata/230404_M06578_0140_000000000-KV3JW/ /data/rawdata/230413_M06578_0141_000000000-KV4JL/ /data/rawdata/230427_M06578_0144_000000000-GF5BT/
these are the test runs. Unfortunatelly, they take up all the space of my hard drive at work.
I asked Sten to buy me a 4TB nvme so can test.
waiting on control
Currently blocked on resources to do the testing.
great news @everyone!
The tests succeeded. I got a couple of questions for @magnulei about some project naming (clean-chickenkilling?) and a small bug which i can fix on the weekend. Otherwise I am ready for version 1.0
Let me know when you want to release.
lets put that on the plan for monday.
karin
-- Karin Lagesen, PhD Bioinformatician, Section for Epidemiology Norwegian Veterinary Institute
On 17.08.2023 18:58, George Marselis @NVI wrote:
great news @everyonehttps://github.com/everyone!
The tests succeeded. I got a couple of questions for @magnuleihttps://github.com/magnulei about some project naming (clean-chickenkilling?) and a small bug which i can fix on the weekend. Otherwise I am ready for version 1.0
Let me know when you want to release.
— Reply to this email directly, view it on GitHubhttps://github.com/NorwegianVeterinaryInstitute/DemultiplexRawSequenceData/issues/57#issuecomment-1682644060, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZJG4HOXJYIYVBPUXL7QFTXVZESLANCNFSM6AAAAAAXMXA23M. You are receiving this because you were mentioned.Message ID: @.***>
@georgemarselis-nvi will write up docs on how the testing was done and why it was ok.
@georgemarselis-nvi and @magnulei and @CathrineAB will make a plan for how to deploy.
How the testing was done:
Inside the /data/rawdata directory there is a file called "testruns" (/data/rawdata/testruns). that is the list of runs agreed upon with @magnulei
This list includes
the script below takes all the above runs and processes them though the demultiplexing script
#!/bin/bash
clear
for run in $( < /data/rawdata/testruns ); do
rm -rf /data/demultiplex/${run}_demultiplex
rm -rf /data/for_transfer/${run}
/usr/bin/python3 /data/bin/demultiplex_script.py $( /usr/bin/basename ${run} )
script is in /data/bin/test.sh
Why was it ok:
Because the visual/manual inspection of the /data/log/*.log looks ok. All bugs seem fixed and the output seems normal and regular.
This should be automated in the future, to do CI/CD but until then, manual inspection is what says it is ok or not.
Plan for how to deploy:
Monday ( 2023-10-04 ) is a good day as there is no sequencing due to miseq being down.
I will copy the script by hand and run tests on
220131_M06578_0085_000000000-DF5PW 201120_M06578_0037_000000000-JC72F 220603_M06578_0105_000000000-KB7MY
if they behave like they did on my computer, we christen deployment of version 1.0 complete.
I want a detailed list with steps of what to do, i.e. down to "cp filex location y", with an agreed upon time with M and C of when this should happen.
Also, I know this is kicked off by a cron job, will we edit this cron job to use a new name for this script, or will we do a slide-in replacement?
Also, is the cron script version controlled?
Also, the clue is not in tests of what happens with existing runs, that should (and has been, afaik) tested before, the clue is whether it picks up and runs a new run without any issues. That is the success criteria we are seeking.
I want a detailed list with steps of what to do, i.e. down to "cp filex location y"
Also, I know this is kicked off by a cron job, will we edit this cron job to use a new name for this script, or will we do a slide-in replacement?
Kind of both. The original script was a symlink "current_demultiplexing_script.py" to the appropriate version, e.g. current_demultiplexing_script_v5.py . All versions were in the same directory, instead of being in version control.
The demultplexing script, both original and to-be-released, is in two parts:
Part A detects new completed rawdata drops by reading the rawdata dir for the names of all the directories that have a registered sequencer serial number^[1] . Then, it compares those directories against the /data/demultiplex directory contents, again, that have a registered sequencer serial number . The first difference it finds, it gets send to Part B as a new RunID. Arvind calculated that demultiplexing should not take more than half an hour and that NVI does not have multiple concurrent rawdata drops^[2] . Process repeats every half hour, so, the lack of process control and lack of concurrency is balanced out by timing the demultiplexing script and giving half an hour between starting a run.
I renamed Part B of the script to "demultiplexing_script.py" , both in Arvind's and my own version. And added /data/bin to github. That was last early last year.
The moment I type "git pull", the changes will be pulled in and the new demultiplexing script (Part B) will be in place.
There is no need to change anything in the cron script right now, or Part A. All nessecary changes in Part A have already been done, though Part A needs more looking after (later on) and incorporated into a single callable module (later on)^[2] .
Also, is the cron script version controlled?
/me points at url at top of browser . There is a copy of Arvind's _v{1-5}.py scripts, if you care to go back through the log. But I have cleaned up the directory structure since then.
with an agreed upon time with M and C of when this should happen.
the exact time down to the minute has not been yet agreed upon. We just said "monday after the sibyl meeting"
[1] This means you can have a billion other directories in there, but unless they have a registered serial number, they are ignored. This used to be an issue as directories as "badruns" were read in, the directory was trying to find the RTAComplete.txt file, failed and aborted since it thought it was a non-completed sequencing run. Which means that the script never got to see any new rawdata drops, as it aborted too early.
[2] I would like to change Part A (and Part B) to being a daemon that watches the filesystem via dbus, but that's for later. That will allow for concurrent demultiplexing attempts and maybe a paper, later on.
Deployment plan:
Positive control is the next time the sequencers run the demultiplexing script runs on its own without human interference.
Regression tests on the script have already been done found on the middle of this script.
The demultiplexing script detects new runs to be processed by diffing the existing demultiplexed runs from the ones current in /data/rawdata
Cron script/crontab is not changing, hence the automatic picking up of data, same way as we have done so far.
Well, 1.0 is now now delivered.
possible solutions:
re-demultiplex all current runs in a separate directory, compare sha512 file-for-file between current and re-demultiplexed (most optimal about what we will get, but most time-consuming, demultiplexing everything will take a day)
ask Catherine for 20? 30? N? significant runs and run them them
don't handle it, yeet it live
in the case of the two, we can also use those for testing
@karinlag @magnulei @CathrineAB