MD5s for automated processing re-run do not match

erflynn commented 1 month ago

@syellamilli identified that while all the md5s match when re-running other steps, this is not the case for automated processing.

I looked into this:

the .tsvs/.csv do have the same .md5 and are not an issue
the .RDS are different because of the timestamps in the @command slot (if you delete these and write them out, the md5s are identical)
the .pdf are different because of the timestamps in the PDFs. If I write them out as pngs, the md5s are identical.

I don't know if we want to update the objects to not contain the date stamps, and switch to pngs? Something worth considering, and definitely important in regression testing.

dtm2451 commented 1 month ago

That's frustrating! I think for the RDS files it may be fine to remove the @command slot... I've certainly never knowingly used it at least. For .pdfs though... I'm betting there are at least a few plots that we'd rather keep as pdf than switch to png if not for this issue.

Maybe there's a way to remove pdf timestamps after the fact? If there is a way to do that, I wonder if we'd want to run it in the testing framework only versus if there would be any side-effects here too.

erflynn commented 1 month ago

yeah this is definitely something to consider. I did a quick google and this was all I could find for suppressing timestamps in R -- not sure if it's actually advisable.

for removing PDF timestamps -- these tools seem promising, but I'm not sure how much this is worth investigating outside of testing?

dtm2451 commented 1 month ago

For pdfs, yea I think this could just be used in the testing side. Not sure if there is a way to run a script only in the testing framework? (that finds pdfs in the test directory and uses one of those tools you linked to strip metadata, thus allowing comparisons of pdfs with the same content to pass) They are small enough that near-duplicate metis data blocks for pdfs doesn't seem like a huge deal.

For RDS objects, I do wonder if it might be okay and worth it to maintain a wrapper on top of saveRDS that performs timestamp stripping in known cases. Both of these pros seem only a minor gain, but we do need SOME solution to allow the testing system work.

pro: metis space saving. Metis uses md5 when checking if a data block is new. If an md5 changes, the original data block will not be re-used.
pro: possibly noticeable time saving for tests to just always use. If test path RDS files become numerous or large, folding this into the testing framework step would take a noticeable amount of time for load -> fix -> re-save. Doing it before the original save would be negligible. On the other hard, this time save might be negligible compared to pipeline run times!
con?: I don't know if we need them for anything, so maybe we should ask in a modeling meeting before merge to main

dtm2451 commented 1 month ago

Something like this could be the saveRDS wrapper for removing timestamps only from any Seurat object.

saveRDS_ <- function(x, file, ...) {
    if (is(x, "Seurat") {
        for (i in seq_along(x@commands)) {
            x@commands[[i]]@time.stamp <- as.POSIXct(NA)
        }
    }
    saveRDS(x, file, ...)
}

erflynn commented 1 month ago

Oo thanks @dtm2451 -- I like this. I agree we should add and then confirm at a DMM before merging.

erflynn commented 1 month ago

other files where md5s do not match (so far, still waiting on automated_processing step to complete re-run with both R versions):

freemuxlet .clust.vcf.gz -- this contains a fileDate string, e.g. ##fileDate=20940717. Perhaps we can change this or remove it post-hoc
freemuxlet .samples.gzip -- I re-run gzip on this after extracting the samples .txt, and gzip produces a timestamp. This is easily removed by switching to gzip -n

dtm2451 commented 1 month ago

Hmmmm... EVERY file being exactly the same is a pretty high bar that leaves a lot of room for external package updates to cause test failures. Just logging that it MIGHT be a better idea to make a targeted list of key elements we want to ensure are the same.

erflynn commented 1 month ago

that's a good pt -- I'm not sure what we want here, but doing this side-by-side to make sure there are no unexplained differences across pipeline versions so that we can migrate to the newer version (for aipi) and to triple check / make sure we didnt accidentally introduce a bug in the pipeline. I dont think we want all these files in the tests, but we will want to check some of them?

erflynn commented 1 month ago

version-ing note: using different RSingleCell containers but the same pipeline yields slightly results from doubletFinder output scores and .rds, as well as downstream affects from this. The affects are limited though. Not sure how to assess this?

(prior steps are the sa, but have different md5s because of different R versions -- but objects are identical)

erflynn commented 1 month ago

fyi -- confirming that "new pipeline" with v3 R container yields identical results to "old pipeline" @syellamilli @dtm2451

UCSF-DSCOLAB / data_processing_pipelines

MD5s for automated processing re-run do not match #83