fastq_merge is very inefficent

Description of the bug

fastq_merge reads in files and writes them out even if they do not need to be merged. This results in 12 hour runtimes of fastqmerge where the file is not changed at all except that its name is different.

Example:

SRR000000_1.fastq if the only run in an experiment and is 50G.
This file is passed to fastq_merge
There are no other files besides SRR000000_1.fastq
fastq_merge will merge all files in the directory without checking how many files there are. It does this by reading in all files, merging them, and writing the single new file out with the new name (i.e. SRX00000_1.fastq).
In the above case of 1 file, the process takes 12 hours to perform because the python code was inefficent and wanted everything in memory.
The only thing different is the name of the file is now SRX00000_1.fastq

This impacts most experiments, as it is increasingly rare with modern sequencers to have multiple runs per experiment.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Here is a not very good fix, but it is faster than the alternative:

New fastq_merge.nf file:

/**
 * This process merges the fastq files based on their sample_id number.
 */
process fastq_merge {
    tag { sample_id }
    container "systemsgenetics/gemmaker:2.1.0"

    input:
    tuple val(sample_id), path(fastq_files)

    output:
    tuple val(sample_id), path("${sample_id}_?.fastq"), emit: FASTQ_FILES
    tuple val(sample_id), val(params.DONE_SENTINEL), emit: DONE_SIGNAL

    script:
    """
    echo "#TRACE sample_id=${sample_id}"
    echo "#TRACE fastq_lines=`cat *.fastq | wc -l`"

    # Use find to locate files matching the pattern in the current directory
    # and count them for both potential paired end
    file_count_1=`find . -maxdepth 1 -type f -name "*_1.fastq" | wc -l`
    #file_count_2=`find . -maxdepth 1 -type f -name "*_2.fastq" | wc -l`

    # Check the number of files. If there is only 1 there is no need to do the merge
    if [ "\${file_count_1}" -gt 1 ] ; then
        echo "There are two or more fastq files. Proceeding to merge"
        merge_fastq.py --fastq_files ${fastq_files.join(" ")} --out_prefix ${sample_id}
    else
        echo "There is one of each file. No Need to merge, renaming instead"

        # Move fatsq _11
        cp *_1.fastq ${sample_id}_1.fastq

        # This command only moves fastq _2 if it exists
        if [ -f *_2.fastq ]; then
                cp *_2.fastq ${sample_id}_2.fastq
                echo "File _2 has been moved."
        else
                echo "File _2 does not exist. This means this sample is not paired"
        fi
    fi
    """
}

This checks if a sample has multiple files. If a sample only has 1 file, it copies that to the current directory. This can not see an edge case where there is only one _1.fastq and two _2.fastq files (but I have never seen this).

Issues with this new code and why I am still not happy: The cp command is more efficent, but not super efficent. I cannot use the mv command because it messes up cleanup step that we have, although it would be a lot more efficent.

A much better alternative would be to split the channel coming out of fastq_dump into files that need to be merged and those that do not. I do not have the time to do this though because it messes with the cleanup steps again and this takes awhile to move arround. @spficklin if you have time this would be a much needed improvement. Message me if you need more details.

SystemsGenetics / GEMmaker