Open JohnHadish opened 3 months ago
Here is a not very good fix, but it is faster than the alternative:
New fastq_merge.nf file:
/**
* This process merges the fastq files based on their sample_id number.
*/
process fastq_merge {
tag { sample_id }
container "systemsgenetics/gemmaker:2.1.0"
input:
tuple val(sample_id), path(fastq_files)
output:
tuple val(sample_id), path("${sample_id}_?.fastq"), emit: FASTQ_FILES
tuple val(sample_id), val(params.DONE_SENTINEL), emit: DONE_SIGNAL
script:
"""
echo "#TRACE sample_id=${sample_id}"
echo "#TRACE fastq_lines=`cat *.fastq | wc -l`"
# Use find to locate files matching the pattern in the current directory
# and count them for both potential paired end
file_count_1=`find . -maxdepth 1 -type f -name "*_1.fastq" | wc -l`
#file_count_2=`find . -maxdepth 1 -type f -name "*_2.fastq" | wc -l`
# Check the number of files. If there is only 1 there is no need to do the merge
if [ "\${file_count_1}" -gt 1 ] ; then
echo "There are two or more fastq files. Proceeding to merge"
merge_fastq.py --fastq_files ${fastq_files.join(" ")} --out_prefix ${sample_id}
else
echo "There is one of each file. No Need to merge, renaming instead"
# Move fatsq _11
cp *_1.fastq ${sample_id}_1.fastq
# This command only moves fastq _2 if it exists
if [ -f *_2.fastq ]; then
cp *_2.fastq ${sample_id}_2.fastq
echo "File _2 has been moved."
else
echo "File _2 does not exist. This means this sample is not paired"
fi
fi
"""
}
This checks if a sample has multiple files. If a sample only has 1 file, it copies that to the current directory. This can not see an edge case where there is only one _1.fastq and two _2.fastq files (but I have never seen this).
Issues with this new code and why I am still not happy:
The cp command is more efficent, but not super efficent. I cannot use the mv
command because it messes up cleanup step that we have, although it would be a lot more efficent.
A much better alternative would be to split the channel coming out of fastq_dump into files that need to be merged and those that do not. I do not have the time to do this though because it messes with the cleanup steps again and this takes awhile to move arround. @spficklin if you have time this would be a much needed improvement. Message me if you need more details.
Description of the bug
fastq_merge reads in files and writes them out even if they do not need to be merged. This results in 12 hour runtimes of fastqmerge where the file is not changed at all except that its name is different.
Example:
This impacts most experiments, as it is increasingly rare with modern sequencers to have multiple runs per experiment.
Command used and terminal output
No response
Relevant files
No response
System information
No response