NBISweden / Earth-Biogenome-Project-pilot

Assembly and Annotation workflows for analysing data in the Earth Biogenome Project pilot project.
https://www.earthbiogenome.org/
GNU General Public License v3.0
10 stars 8 forks source link

optimize PURGEDUPS_SPLITFA process #98

Open MartinPippel opened 4 months ago

MartinPippel commented 4 months ago

Is your feature request related to a problem? Please describe. not really, but this step can potentially run faster without copying the compressed file. It also seems that the following line:

def useGzip = !( assembly instanceof List ? assembly.every{ it.name.endsWith(".gz") } : assembly.name.endsWith(".gz") )

is not doing what its supposed to do. (Probably due to the negation symbol at the beginning?) In my case its copying the compressed file:

cat hifiasm-raw-default.asm.bp.p_ctg.fasta.gz > MYSPECIESNAME_hifiasm-purged-default_hap0.merged.fasta.gz

Describe the solution you'd like So I think Dengfeng's split_fa script can directly read compressed files from stdin see here and the PURGEDUPS_SPLITFA process could potentially be reduced to:

script:
    def args = task.ext.args ?: ''
    def prefix = task.ext.prefix ?: "${meta.id}"
    """
    cat ${prefix}.merged.fasta.gz | split_fa $args - | gzip -c > ${prefix}.split.fasta.gz

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        purgedups: \$( purge_dups -h |& sed '3!d; s/.*: //' )
    END_VERS

One potential pitfall might be if the user specifies the - option with the external arguments.

mahesh-panchal commented 4 months ago

It also seems that the following line:

def useGzip = !( assembly instanceof List ? assembly.every{ it.name.endsWith(".gz") } : assembly.name.endsWith(".gz") )

is not doing what its supposed to do. (Probably due to the negation symbol at the beginning?) In my case its copying the compressed file:

cat hifiasm-raw-default.asm.bp.p_ctg.fasta.gz > MYSPECIESNAME_hifiasm-purged-default_hap0.merged.fasta.gz

This part is correct. Concatenating gzip files results in a valid gzipped file.