Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 15 forks source link

Patch standardise sumstats column headers crossplatform (+QOL improvements) #174

Closed cfbeuchel closed 9 months ago

cfbeuchel commented 9 months ago

Hi! This is my second PR for the function standardise_sumstats_column_headers_crossplatform(). This PR includes the bugfix from the other PR plus some QOL improvements, at least for me. Feel free to merge either one (or none of course), whatever you think best.

Just to explain: I am using a custom mapping file for format_sumstats() and ran into continuous errors due to being a scatterbrain and maybe some ambiguities in the code that could be avoided. These are:

  1. The mapping table must be a data.frame
  2. The entries in mapping_file$Uncorrected must be in upper case

Point 1 is simply a problem when you, as I do, prefer to work in data.table, but a function explicitly requires a data.frame. Passing a data.table to the function will cause it to fail weirdly downstream, so I thought it useful to introduce a check. Something else might also work (maybe all(class(mapping_file) == "data.frame")), but at least this way, my common error of submitting a DT is caught early.

Point 2 is I think a more generally useful check because it is not mentioned anywhere explicitly. This is that the formatting check turns all column names to UPPERCASE. Therefore, entering the original column names (myBadCol or whatever) will not be corrected, because the function will only check for MYBADCOL in mapping_file$Uncorrected. Therefore, I thought it useful to turn everything in mapping_file to upper case when a custom table is supplied.

Let me know what you think!

Again, thanks for the work on the package, you've saved me a lot of time and headache.

Cheers, Carl

Al-Murphy commented 9 months ago

Hey!

I like the idea and happy to add in but two changes:

I wouldn't add a stop error if it isn't a dataframe like you have here:

stopifnot(
            "Mapping file must be a data.frame!" =
            !data.table::is.data.table(mapping_file)
        )

So we already have checks to make sure the mapping_file is a data.frame (at least super class, DT's are a subset of DFs) here and we also check there are two columns so it is fair to assume that we can just convert to a DF. This won't affect anything that is already a DF. So just change that line to:

mapping_file <- as.data.frame(mapping_file)

The second change is checking if it is a custom mapping file, you have a check for this just based on the number of rows:

        if(nrow(mapping_file) != nrow(sumstatsColHeaders)) {
      message(
          "Non-standard mapping file detected.",
          "Making sure all entries in `Uncorrected` are in upper case.")
      mapping_file$Uncorrected <- toupper(mapping_file$Uncorrected)
      }

I'd rather check if they are the exact same:

 if(!all.equal(mapping_file,sumstatsColHeaders)) {

Can you make these changes and then just have a look at the other PR you made and if you agree with my comment, revert that change in this PR too!

Cheers for your work on this and delighted you find the package useful! Alan.

cfbeuchel commented 9 months ago

Hi! I appreciate you suggestions. They are less ambiguous and I'll gladly implement them! Instead of reverting the changes here I'll just open a new PR from a new branch!