Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Improve chromosome standardization #157

Closed mykmal closed 1 year ago

mykmal commented 1 year ago

This update makes the following three changes, as also listed in the NEWS.md file:

Initially my plan was to only fix the bug in the third bullet point, but I ended up rewriting the entire check_chr() function to improve how the CHR column is standardized. I think that the new function is more efficient and handles some additional edge cases that could trip up the current version of MSS.

I tested the updated package on my GWAS data and it appears to work as expected, but I'm happy to explain my code in more detail and/or make any additional changes if needed.

Al-Murphy commented 1 year ago

Thanks for your input! Your changes make a lot of sense and do help clean up that function. Really not sure why I never used the rmv_chrPrefix parameter. I have run checks locally and it all seems fine to me.- Just two small changes before I merge:

  • Checking individual file sizes...
    • WARNING: Package files exceed the 5MB size limit.

It appears one file you added to the git (and then likely removed after but is still in history? Is too large. Could you have a look at removing the file causing the issue from git history? I have info on doing that here You may find it cleaner to just clone the repo and put all your changes and resubmit a pull request? Up to you!

Alan.

mykmal commented 1 year ago

Thank you for carefully reviewing my (very first) pull request! Below are responses to each of your points.

CHR standards

In your opinion, is that standard to have X rather than chrX - can you point to somewhere this is the standard?

Good catch! Here are the commonly used standards:

> GenomeInfoDb::genomeStyles("Homo sapiens")
   circular  auto   sex NCBI  UCSC dbSNP Ensembl
1     FALSE  TRUE FALSE    1  chr1   ch1       1
2     FALSE  TRUE FALSE    2  chr2   ch2       2
3     FALSE  TRUE FALSE    3  chr3   ch3       3
4     FALSE  TRUE FALSE    4  chr4   ch4       4
5     FALSE  TRUE FALSE    5  chr5   ch5       5
6     FALSE  TRUE FALSE    6  chr6   ch6       6
7     FALSE  TRUE FALSE    7  chr7   ch7       7
8     FALSE  TRUE FALSE    8  chr8   ch8       8
9     FALSE  TRUE FALSE    9  chr9   ch9       9
10    FALSE  TRUE FALSE   10 chr10  ch10      10
11    FALSE  TRUE FALSE   11 chr11  ch11      11
12    FALSE  TRUE FALSE   12 chr12  ch12      12
13    FALSE  TRUE FALSE   13 chr13  ch13      13
14    FALSE  TRUE FALSE   14 chr14  ch14      14
15    FALSE  TRUE FALSE   15 chr15  ch15      15
16    FALSE  TRUE FALSE   16 chr16  ch16      16
17    FALSE  TRUE FALSE   17 chr17  ch17      17
18    FALSE  TRUE FALSE   18 chr18  ch18      18
19    FALSE  TRUE FALSE   19 chr19  ch19      19
20    FALSE  TRUE FALSE   20 chr20  ch20      20
21    FALSE  TRUE FALSE   21 chr21  ch21      21
22    FALSE  TRUE FALSE   22 chr22  ch22      22
23    FALSE FALSE  TRUE    X  chrX   chX       X
24    FALSE FALSE  TRUE    Y  chrY   chY       Y
25     TRUE FALSE FALSE   MT  chrM  chMT      MT

Notice that the NCBI and Ensembl styles are identical, and match the format that MSS enforces when rmv_chrPrefix is TRUE. On the other hand, the UCSC style includes a "chr" prefix and codes mitochondrial DNA as "chrM" (without the T). In light of this, do you think it would make more sense to replace the rmv_chrPrefix parameter with a chr_style parameter that lets users choose between the NCBI/Ensembl and UCSC styles? (I've never encountered the dbSNP style in GWAS, but I could also provide that option if you want.)

File size checks

It appears one file you added to the git (and then likely removed after but is still in history? Is too large.

Weird, I'm not sure what is causing that. Once we decide on how to handle chromosome styles, I'll go ahead and re-commit my changes in a clean branch, as you suggested.

Contributor list

Secondly, this is optional but I think you should add your name to the list of contributors in the README with a link to your github page.

Thanks for inviting me to join the list of contributors! I started actively using MSS this summer and find it very convenient, so this will likely not be my last contribution.

Al-Murphy commented 1 year ago

Thank you for carefully reviewing my (very first) pull request!

Thank you very much for contributing! People doing so really takes (at least a little of) the burden of upkeep of the package off me and ensures it remains useful going forward.

In light of this, do you think it would make more sense to replace the rmv_chrPrefix parameter with a chr_style parameter that lets users choose between the NCBI/Ensembl and UCSC styles? (I've never encountered the dbSNP style in GWAS, but I could also provide that option if you want.)

I think adding a chr_style parameter instead is a great idea - can you set the default to Ensembl/NCBI and add an explanation and example of each style in the documentation? Probably best to also mention that this used to be rmv_chrPrefix in that parameter documentation too. Then just enforce that the user picks one of these 3 in the validate_parameters function? rmv_chrPrefix was actually never passed to this function but it should have been too.

Weird, I'm not sure what is causing that. Once we decide on how to handle chromosome styles, I'll go ahead and re-commit my changes in a clean branch, as you suggested.

Great, thanks!!

Thanks for inviting me to join the list of contributors! I started actively using MSS this summer and find it very convenient, so this will likely not be my last contribution.

Good to hear you find it useful, always happy to add in new features!

Cheers, Alan.

mykmal commented 1 year ago

Closing this in order to open a clean PR.