ctg-lund / Yggdrasil

the backbone of the CTG pipelines
0 stars 0 forks source link

Yggdrasil crashes during demux #18

Closed Fattigman closed 4 months ago

Fattigman commented 1 year ago

What I tried to do: Run yggdrasil on a function illumina v2 samplesheet: The command I ran:

/projects/fs1/shared/external-tools/nextflow/latest/nextflow -bg run /projects/fs1/shared/Yggdrasil/main.nf --samplesheet /projects/fs1/nas-sync/upload/230616_A00681_0881_BH3MV7DSX7/CTG_SampleSheet_S4_230615.csv --rawdata /projects/fs1/nas-sync/upload/230616_A00681_0881_BH3MV7DSX7 --outdir /projects/fs1/shared/Jobs/ -profile ctg > test.log

What happened: The demultiplex process crashed with the message:

Exception thrown in ../src/host/dragen_api/file_io/async_io.cpp line 765 -- Wrote 458752 bytes instead of expected 1048576 bytes.  Check disk space. Aborted on 6th partially-complete I/O.  i=0
Dumping diagnostics....
Initial crash reason: Exception thrown in ../src/host/dragen_api/file_io/async_io.cpp line 765 -- Wrote 458752 bytes instead of expected 1048576 bytes.  Check disk space. Aborted on 6th partially-complete I/O.  i=0
Exception thrown in ../src/host/dragen_api/file_io/async_io.cpp line 765 -- Wrote 65536 bytes instead of expected 1048576 bytes.  Check disk space. Aborted on 6th partially-complete I/O.  i=0
WARNING: Could not write replay file /var/log/bcl-convert/dragen_replay_1687269966492_98223.json: /var/log/bcl-convert/dragen_replay_1687269966492_98223.json: cannot open file
DRAGEN replay file saved to /var/log/bcl-convert/dragen_replay_1687269966492_98223.json
Initial crash reason: Exception thrown in ../src/host/dragen_api/file_io/async_io.cpp line 765 -- Wrote 458752 bytes instead of expected 1048576 bytes.  Check disk space. Aborted on 6th partially-complete I/O.  i=0
Exception thrown in ../src/host/dragen_api/file_io/async_io.cpp line 765 -- Wrote 196608 bytes instead of expected 1048576 bytes.  Check disk space. Aborted on 6th partially-complete I/O.  i=0
sh: 1: cannot create /var/log/bcl-convert/dragen_info_1687269966492_98223.log: Directory nonexistent
DRAGEN registers saved to /var/log/bcl-convert/dragen_info_1687269966492_98223.log
Initial crash reason: Exception thrown in ../src/host/dragen_api/file_io/async_io.cpp line 765 -- Wrote 458752 bytes instead of expected 1048576 bytes.  Check disk space. Aborted on 6th partially-complete I/O.  i=0
Exception thrown in ../src/host/dragen_api/file_io/async_io.cpp line 765 -- Wrote 196608 bytes instead of expected 1048576 bytes.  Check disk space. Aborted on 6th partially-complete I/O.  i=0
Hang diagnostic saved to /var/log/bcl-convert/hang_diag_1687269966492_98223.txt
sh: 1: cannot create /var/log/bcl-convert/pstack_1687269966684_98223.log: Directory nonexistent
/bin/sh: 1: cannot create /var/log/bcl-convert/pstack_1687269966684_98223.log: Directory nonexistent
pstack saved to /var/log/bcl-convert/pstack_1687269966684_98223.log
terminate called after throwing an instance of 'AioReturnFailed'
terminate called recursively
terminate called recursively
terminate called recursively
/projects/fs1/shared/Nextflow/fc/c6d2ec9de63971b4c154132644c03e/.command.sh: line 2: 98223 Aborted                 (core dumped) bcl-convert --bcl-input-directory 230616_A00681_0881_BH3MV7DSX7 --output-directory . --force --sample-sheet CTG_SampleSheet_S4_230615.csv --bcl-sampleproject-subdirectories true --strict-mode true --bcl-only-matched-reads true --bcl-num-parallel-tiles 16

It looks like bclconvert thinks its too little space for a demux, but there is more than enough space available!

Fattigman commented 1 year ago

This is the command used by Yggdrasil:

bcl-convert     --bcl-input-directory 230616_A00681_0881_BH3MV7DSX7     --output-directory .     --force     --sample-sheet CTG_SampleSheet_S4_230615.csv     --bcl-sampleproject-subdirectories true     --strict-mode true     --bcl-only-matched-reads true     --bcl-num-parallel-tiles 16

For some reason neither the path of the samplesheet or runfolder gets symlinked to the workdir. Furthermore no output can be found in the workdir either.

Fattigman commented 1 year ago

workdir: /projects/fs1/shared/Nextflow/b0/492d59b5ce54a5c014ebae9e979dc2

Fattigman commented 1 year ago

Running with stub just copies the whole flowcell into the workdir?!

/projects/fs1/shared/external-tools/nextflow/latest/nextflow -bg run /projects/fs1/shared/Development_Github/Yggdrasil/main.nf --samplesheet /projects/fs1/nas-sync/upload/230616_A00681_0881_BH3MV7DSX7/CTG_SampleSheet_S4_230615.csv --rawdata /projects/fs1/nas-sync/upload/230616_A00681_0881_BH3MV7DSX7 --outdir /projects/fs1/shared/Jobs/ -profile ctg -stub > test.log
Fattigman commented 1 year ago

I won't continue debug this until @lokeshbio or @chaetognatha can provide a working example.

chaetognatha commented 1 year ago

I think the first thing to try would be to rebuild the bcl-convert image and put it in the correct location under shared/containers and then test that directly, I'll see if I have time to do that today

Fattigman commented 1 year ago

I think the problem lies more with the nextflow code. As I stated earlier:

For some reason neither the path of the samplesheet or runfolder gets symlinked to the workdir.

The bclconvert process dumps the data elsewhere than the workdir, where it should land.

It seems to me there is a weird interaction between nextflow and bclconvert, unless I can get a working example.

However, I am all for cleaning up the container directory!

chaetognatha commented 1 year ago

You still may be right, but I found for example that we are using Test_Jobs as the default root for new jobs and for containers, which is incorrect, I would like our future container path to be Shared/Containers but the legacy place for all the containers is shared/ctg-containers and changing that would break legacy and we might as well wait until we get COSMOS-SENS and can plan everything from the ground up, the expectation is that legacy wont work there anyways.

I am working on correcting these paths and working through the config atm and am hoping to be done in a little bit, then I will move on and update here as I go!

Fattigman commented 1 year ago

Sounds good!

chaetognatha commented 1 year ago

The project you tried initially has now been running for a while, but I realize it is rather big so I also started a run using only the minimal test data project.

chaetognatha commented 1 year ago

image seems to work fine, looks like the multiqc image that I added is broken, so I will update that one and retest @Fattigman

Fattigman commented 1 year ago

Now it crashed again for larger 230616


Initial crash reason: Exception thrown in ../src/host/dragen_api/file_io/async_io.cpp line 765 -- Wrote 524288 bytes instead of expected 1048576 bytes.  Check disk space. Aborted on 6th partially-complete I/O.  i=0
Exception thrown in ../src/host/dragen_api/file_io/async_io.cpp line 765 -- Wrote 581632 bytes instead of expected 1048576 bytes.  Check disk space. Aborted on 6th partially-complete I/O.  i=0
Dumping diagnostics....
Sample sheet being processed by common lib? Yes
SampleSheet Settings:
  CreateFastqForIndexReads = 0

shared-thread-linux-native-asio output is disabled
bcl-convert Version 00.000.000.4.0.3
Copyright (c) 2014-2022 Illumina, Inc.
Command Line: --bcl-input-directory 230616_A00681_0881_BH3MV7DSX7 --output-directory . --force --sample-sheet CTG_SampleSheet_S4_230615.csv --bcl-sampleproject-subdirectories true --strict-mode true --bcl-only-matched-reads true --bcl-num-parallel-tiles 16
Conversion Begins.
# CPU hw threads available: 20
Parallel Tiles: 16. Threads Per Tile: 1
SW compressors: 20
SW decompressors: 10
SW FASTQ compression level: 1
WARNING: Could not write replay file /var/log/bcl-convert/dragen_replay_1687350251432_171770.json: /var/log/bcl-convert/dragen_replay_1687350251432_171770.json: cannot open file
DRAGEN replay file saved to /var/log/bcl-convert/dragen_replay_1687350251432_171770.json
sh: 1: cannot create /var/log/bcl-convert/dragen_info_1687350251432_171770.log: Directory nonexistent
DRAGEN registers saved to /var/log/bcl-convert/dragen_info_1687350251432_171770.log
Hang diagnostic saved to /var/log/bcl-convert/hang_diag_1687350251432_171770.txt
sh: 1: cannot create /var/log/bcl-convert/pstack_1687350251508_171770.log: Directory nonexistent
pstack saved to /var/log/bcl-convert/pstack_1687350251508_171770.log
/bin/sh: 1: terminate called after throwing an instance of 'cannot create /var/log/bcl-convert/pstack_1687350251508_171770.log: Directory nonexistent
AioReturnFailed'
terminate called recursively
/projects/fs1/shared/Nextflow/a6/1e0022ebbffa8b10cc860a6b50ae80/.command.sh: line 2: 171770 Aborted                 (core dumped) bcl-convert --bcl-input-directory 230616_A00681_0881_BH3MV7DSX7 --output-directory . --force --sample-sheet CTG_SampleSheet_S4_230615.csv --bcl-sampleproject-subdirectories true --strict-mode true --bcl-only-matched-reads true --bcl-num-parallel-tiles 16```
Fattigman commented 1 year ago

Again, comparing with the 221111 run, we can see that the data has been correctly set up with symlinks.

But there are no symlinks for the 230616 run.

I can't make any sense out of it.

chaetognatha commented 1 year ago

image

very strange,

to replicate this working result just do:

nextflow run Yggdrasil/ --rawdata Test_Jobs/Test_Data/SeqOnly/221111_VH00947_17_AACGHWHM5/ --samplesheet Test_Jobs/Test_Data/SeqOnly/CTG_SampleSheet.csv --output Test_Jobs/Test_Out

I did discover that for some reason the --output parameter is not working properly while testing

chaetognatha commented 1 year ago

I am wondering if maybe there is a problem with your nextflow binary or the packages you need to load to use it.

I am using nextflow version 22.10.6.5844

Fattigman commented 1 year ago

Mine is 22.04.5.5709 Can you send me the path to your binary?

chaetognatha commented 1 year ago

It was from my ~/Scripts so I copied it to shared/shared-scripts

chaetognatha commented 1 year ago

I think I also run Java 11.0.2 via lmod

chaetognatha commented 1 year ago

we definitely need an sbatch script for Yggdrasil that loads the right modules and ensures we have the right binaries

chaetognatha commented 1 year ago

image

very strange,

to replicate this working result just do:

nextflow run Yggdrasil/ --rawdata Test_Jobs/Test_Data/SeqOnly/221111_VH00947_17_AACGHWHM5/ --samplesheet Test_Jobs/Test_Data/SeqOnly/CTG_SampleSheet.csv --output Test_Jobs/Test_Out

I did discover that for some reason the --output parameter is not working properly while testing

I see now that it should be --outdir instead of --output my bad!

Fattigman commented 1 year ago

Lets check it out with outdir instead!

we definitely need an sbatch script for Yggdrasil that loads the right modules and ensures we have the right binaries

Agree! But this should maybe be done in the bash script that initializes Yggdrasil in cron?

chaetognatha commented 1 year ago

That would make the most sense, I could write one now and put it in Yggdrasil/bin

Fattigman commented 1 year ago

Nvm Im stupid, I thought you meant for my script...

chaetognatha commented 1 year ago

Nvm Im stupid, I thought you meant for my script...

outdir? no unfortunately that was just my bad when I was testing

Fattigman commented 1 year ago

It crashed again with your binary with the same error. I guess we will postpone the deployment of Yggdrasil until we can get stable demultiplexing.