RepeatModeler/Masker are an i/o problem

KatharinaHoff commented 6 months ago

RepeatModeler/RepeatMasker are extremely heavy on the file system i/o. We will likely get a complaint from the HPC admin to run it as it is implemented for now.

The problem is that these tools require to be inside the directory where they write a lot of (temporary) files.

Outside of snakemake, we usually either go to /tmp/{USER}/rm , copy the genome file there, and then execute from there. On snowball and batch, there is also the option to do the same in /dev/shm/rm, which is way faster. Both options keep the traffic on the node, do not harm the entire cluster i/o volume.

My snakemake workflow died when I tried to cd in the snakemake shell. But there may be other options to cd to the directory, i.e. call a bash script from the snakemake shell, or call a python script from the snakemake shell, that performs the cd and the launching of the tools.

@claraptzsl please test whether any of these options work on a minimal toy example. If we can get changing to the execution directory to work, we can fix it here in the repeat masking rule, and that would make things a lot better.

KatharinaHoff commented 6 months ago

Minimal example to do cd in snakemake

Snakefile:

# Snakefile

rule my_rule:
    input:
        # Input files or wildcards
    output:
        # Output files
    shell:
        """
        ./run_task.sh ..
        """

run_task.sh:

#!/bin/bash

echo $PWD
# Change to the desired directory
cd "$1"
echo $PWD

snakemake -s Snakefile my_rule --cores 1

This works for me. We can probably wrap the RepeatModeler/RepeatMasker commands in such a bash script. The same applies to VARUS, but we won't implement that now. That would be rather a task for @StepanSaenko if he decides to build on this codebase.

KatharinaHoff commented 4 months ago

@StepanSaenko The varus container is available here: https://hub.docker.com/repository/docker/katharinahoff/varus-notebook/general You need to be careful because of the chdir problem, but here it is outlines how to get around it.

KatharinaHoff / braker-snake

RepeatModeler/Masker are an i/o problem #6