UCSF-DSCOLAB / data_processing_pipelines

A repository to store the existing pipelines to process the various CoLabs datasets
0 stars 1 forks source link

MD5s for automated processing re-run do not match #83

Open erflynn opened 1 month ago

erflynn commented 1 month ago

@syellamilli identified that while all the md5s match when re-running other steps, this is not the case for automated processing.

I looked into this:

I don't know if we want to update the objects to not contain the date stamps, and switch to pngs? Something worth considering, and definitely important in regression testing.

dtm2451 commented 1 month ago

That's frustrating! I think for the RDS files it may be fine to remove the @command slot... I've certainly never knowingly used it at least. For .pdfs though... I'm betting there are at least a few plots that we'd rather keep as pdf than switch to png if not for this issue.

Maybe there's a way to remove pdf timestamps after the fact? If there is a way to do that, I wonder if we'd want to run it in the testing framework only versus if there would be any side-effects here too.

erflynn commented 1 month ago

yeah this is definitely something to consider. I did a quick google and this was all I could find for suppressing timestamps in R -- not sure if it's actually advisable.

for removing PDF timestamps -- these tools seem promising, but I'm not sure how much this is worth investigating outside of testing?

dtm2451 commented 1 month ago

For pdfs, yea I think this could just be used in the testing side. Not sure if there is a way to run a script only in the testing framework? (that finds pdfs in the test directory and uses one of those tools you linked to strip metadata, thus allowing comparisons of pdfs with the same content to pass) They are small enough that near-duplicate metis data blocks for pdfs doesn't seem like a huge deal.

For RDS objects, I do wonder if it might be okay and worth it to maintain a wrapper on top of saveRDS that performs timestamp stripping in known cases. Both of these pros seem only a minor gain, but we do need SOME solution to allow the testing system work.

dtm2451 commented 1 month ago

Something like this could be the saveRDS wrapper for removing timestamps only from any Seurat object.

saveRDS_ <- function(x, file, ...) {
    if (is(x, "Seurat") {
        for (i in seq_along(x@commands)) {
            x@commands[[i]]@time.stamp <- as.POSIXct(NA)
        }
    }
    saveRDS(x, file, ...)
}
erflynn commented 1 month ago

Oo thanks @dtm2451 -- I like this. I agree we should add and then confirm at a DMM before merging.

erflynn commented 1 month ago

other files where md5s do not match (so far, still waiting on automated_processing step to complete re-run with both R versions):

dtm2451 commented 1 month ago

Hmmmm... EVERY file being exactly the same is a pretty high bar that leaves a lot of room for external package updates to cause test failures. Just logging that it MIGHT be a better idea to make a targeted list of key elements we want to ensure are the same.

erflynn commented 1 month ago

that's a good pt -- I'm not sure what we want here, but doing this side-by-side to make sure there are no unexplained differences across pipeline versions so that we can migrate to the newer version (for aipi) and to triple check / make sure we didnt accidentally introduce a bug in the pipeline. I dont think we want all these files in the tests, but we will want to check some of them?

erflynn commented 1 month ago

version-ing note: using different RSingleCell containers but the same pipeline yields slightly results from doubletFinder output scores and .rds, as well as downstream affects from this. The affects are limited though. Not sure how to assess this?

(prior steps are the sa, but have different md5s because of different R versions -- but objects are identical)

erflynn commented 1 month ago

fyi -- confirming that "new pipeline" with v3 R container yields identical results to "old pipeline" @syellamilli @dtm2451