SystemsGenetics / GEMmaker

A workflow for construction of Gene Expression count Matrices (GEMs). Useful for Differential Gene Expression (DGE) analysis and Gene Co-Expression Network (GCN) construction
https://gemmaker.readthedocs.io/en/latest/
MIT License
33 stars 16 forks source link

fastq_merge is very inefficent #288

Open JohnHadish opened 3 months ago

JohnHadish commented 3 months ago

Description of the bug

fastq_merge reads in files and writes them out even if they do not need to be merged. This results in 12 hour runtimes of fastqmerge where the file is not changed at all except that its name is different.

Example:

This impacts most experiments, as it is increasingly rare with modern sequencers to have multiple runs per experiment.

Command used and terminal output

No response

Relevant files

No response

System information

No response

JohnHadish commented 3 months ago

Here is a not very good fix, but it is faster than the alternative:

New fastq_merge.nf file:

/**
 * This process merges the fastq files based on their sample_id number.
 */
process fastq_merge {
    tag { sample_id }
    container "systemsgenetics/gemmaker:2.1.0"

    input:
    tuple val(sample_id), path(fastq_files)

    output:
    tuple val(sample_id), path("${sample_id}_?.fastq"), emit: FASTQ_FILES
    tuple val(sample_id), val(params.DONE_SENTINEL), emit: DONE_SIGNAL

    script:
    """
    echo "#TRACE sample_id=${sample_id}"
    echo "#TRACE fastq_lines=`cat *.fastq | wc -l`"

    # Use find to locate files matching the pattern in the current directory
    # and count them for both potential paired end
    file_count_1=`find . -maxdepth 1 -type f -name "*_1.fastq" | wc -l`
    #file_count_2=`find . -maxdepth 1 -type f -name "*_2.fastq" | wc -l`

    # Check the number of files. If there is only 1 there is no need to do the merge
    if [ "\${file_count_1}" -gt 1 ] ; then
        echo "There are two or more fastq files. Proceeding to merge"
        merge_fastq.py --fastq_files ${fastq_files.join(" ")} --out_prefix ${sample_id}
    else
        echo "There is one of each file. No Need to merge, renaming instead"

        # Move fatsq _11
        cp *_1.fastq ${sample_id}_1.fastq

        # This command only moves fastq _2 if it exists
        if [ -f *_2.fastq ]; then
                cp *_2.fastq ${sample_id}_2.fastq
                echo "File _2 has been moved."
        else
                echo "File _2 does not exist. This means this sample is not paired"
        fi
    fi
    """
}

This checks if a sample has multiple files. If a sample only has 1 file, it copies that to the current directory. This can not see an edge case where there is only one _1.fastq and two _2.fastq files (but I have never seen this).

Issues with this new code and why I am still not happy: The cp command is more efficent, but not super efficent. I cannot use the mv command because it messes up cleanup step that we have, although it would be a lot more efficent.

A much better alternative would be to split the channel coming out of fastq_dump into files that need to be merged and those that do not. I do not have the time to do this though because it messes with the cleanup steps again and this takes awhile to move arround. @spficklin if you have time this would be a much needed improvement. Message me if you need more details.