Need the ability to skip md5 check for some files in snapshot

stevekm commented 6 months ago

I am trying to avoid taking the md5sum for some files in my snapshot.

I found this example of one potential method that was shared online here, which I have modified as such;


nextflow_pipeline {
    name "Test main Pipeline"
    script "main.nf"

    test("Should run without failures") {

        when {
            params {
                // NOTE: make sure 'outdir' is defined inside the JSON!
                load("$baseDir/examples/params.small.json")
            }
        }

        then {
            assert workflow.success

            def exclude_suffix = [".html", "_complete", "_invocation",
            "_outs", "_vdrkill", "_args","_complete",
            "_jobinfo","_log","_outs","_stderr","_stdout",
            "_chunk_defs", "_stage_defs", "_disabled",
            "_cmdline", "_filelist", "_finalstate", "_jobmode", "_mrosource", "_perf", "_sitecheck",
            "_tags", "_timestamp", "_uuid", "_versions"]

            assert snapshot(
                workflow,
                path("${params.outdir}")
                        .list()
                        .collect { getRecursiveFileNames(it, "${params.outdir}") }
                        .flatten()
                        .findAll {
                            def keep = true
                            exclude_suffix.each { suffix ->
                                if (it.toString().endsWith(suffix)) {
                                    keep = false
                                    // println "${it} : ${keep}"
                                    return keep // Exit the loop early if a match is found
                                }
                            }
                            // println "${it} : ${keep}"
                            return keep
                        }
            ).match()

        }

    }
}

def getRecursiveFileNames(fileOrDir, outputDir) {
    if(file(fileOrDir.toString()).isDirectory()) {
        return fileOrDir.list().collect { getRecursiveFileNames(it, outputDir) }
    }
    return fileOrDir.toString().replace("${outputDir}/", "")
}

It works to exclude the files with the listed suffixes, but the snapshot now only contains a list of files, no md5's for the remaining files in the list. Also, I realized that what I really wanted was to just exclude only the md5 from the files with inconsistent hashes, instead of removing them entirely. Not sure how to implement that. Can we have a feature that just builds this in to the nf-test directly?

stevekm commented 6 months ago

I think this is related to this issue https://github.com/askimed/nf-test/issues/116 however the main difference that I still want to check for the existence of the files, just not their md5

GallVp commented 4 months ago

I was in a somewhat similar situation and resorted to the following logic for the orthofinder module:

import groovy.io.FileType

.
.
.

assert process.success

def all_files = []

file(process.out.orthofinder[0][1]).eachFileRecurse (FileType.FILES) { file ->
    all_files << file
}

def all_file_names = all_files.collect { it.name }.sort(false)

def stable_file_names = [
    'Statistics_PerSpecies.tsv',
    'SpeciesTree_Gene_Duplications_0.5_Support.txt',
    'SpeciesTree_rooted.txt'
]

def stable_files = all_files.findAll { it.name in stable_file_names }

assert snapshot(
    all_file_names,
    stable_files,
    process.out.versions[0]
).match()

edmundmiller commented 1 month ago

https://github.com/nf-core/nft-utils has this functionality

lukfor commented 1 month ago

I think nft-utils is the best way to implement this logic, so I am closing this issue.

askimed / nf-test

Need the ability to skip md5 check for some files in snapshot #211