Open explodecomputer opened 5 years ago
The basic workflow: 1) extract relevant columns from harmoinised.bcf back to a tabular file, read into R, joined with the reference allele frequency column from the tabular file from the reference data. 2) calculate metrics, read metadata metrics file and mrbase metadata, and render the plots and report.
The things I think need refactoring:
- The input gwas can optionally be a tabular file.
Don't work too hard on making it very general - I think generally what will happen is
- The metadata metrics file will be deprecated, and the metadata metrics read from the bcf header, and in case the input gwas is a tabular file, the metadata section in the report will be empty. Until we have rules in what metrics to be included in the bcf in uniform, all header information will be included in the report.
Sounds good
- Regarding the reference data, what I read from your earlier email is that this is uniquely associated? In that sense, what are the reference datasets associated with the uk biobank gwas?
The reference data is essentially a list of positions in the genome with the canonical reference allele based on that version of the human genome. It is there to harmonise the uk bb gwas so that the effect allele is always related to the non-reference allele. See here https://ieugit-scmv-d0.epi.bris.ac.uk/ml18692/gwas_harmonisation
- I think the systematic QC metrics can be added to the report module as a sub-component, and produced as a "qc-metrics.json" which can then be read into the rmarkdown report, or used as-is. To systematically QC the 20,000 uk biobank gwas we can produce a qc-metrics.json for each of them and then aggregate these metrics together.
Sounds sensible
- Regarding Phil's point on allowing user access to the reports, are there any potentially sensitive information to only be included in a separate internal version? In the current dev branch I use Hmisc::describe for summary stats and I am thinking of adding previews on head/tail section of the input data.
I think generally not going to be anything sensitive
Chris in response to your points, 1) There is a sample size column now for the legacy data. None of the previously formatted datasets have ncase and ncontrol, so I can’t do this unless somebody reformats those datasets. We could try to do it for datasets going forwards 2) There were 2.5m SNPs in the GIANT BMI dataset because it was a hapmap2 study. The metadata in the bcf files has the info about how many variants are read, how many are removed for different reasons, how many are left etc 3) What the bcf conversion does now is 1. Align snps to a reference (so it checks chr, pos, alleles) and 2. For all numeric columns it is setting non-finite values to missing. My worry about the qc.py script is that it might remove a lot of things that we want to keep e.g. indels, multi-allelic variants, variants that don’t have an rs id etc. If you could identify the specific things related to the numeric columns that need to be done that would be very useful.
Finally, there is no harm in doing this iteratively. If we version the reports then as long as it is producing the main things to begin with (basically what Yi has implemented already – most important is lambda and qq plot to begin with) and it is functioning reliably then we can incrementally add extra qc metrics over time. My strong preference would be to have the basic programmatic interface completed first and then add extra functionality afterwards. This means that we can get on with the rest of the pipeline.
Hi @explodecomputer thanks for the new data and opening the discussion. It would be good to have a centralised place (maybe https://github.com/MRCIEU/mr-base-structure or a public repo) to document 1) the overall architecture of the gwas pipeline, and 2) each component of the pipeline and their inputs / outputs.
For the report module I have finished refactoring it, and regarding @elswob's ukbiobank gwas after we can hopefully settle on bcf format for this dataset it will be easy to compute systematic qc metrics for the whole dataset.
ps. the beta column is called "EFFECT" in data.bcf and I will use this format till we have a new format in place.
THanks Gib.
Another thing we have to do is produce a metaQC report at the very end that includes and compares all QC metrics for all studies in one file. E.g. multiple scatter plots comparing the QC metrics against sample size for all the studies. IMO this is will be a very effective way for spotting outlier and problematic studies.
Thanks @YiLiu6240. Another change that was made over the weekend after discussion with @mcgml was that it might be easier to not have the gwas.*
fields because they are liable to change, and just have the the gwas.id
so that you can retrieve the info from the API (e.g. for study 2: http://apitest.mrbase.org/gwasinfo/2)
@mightyphil2000 agree, can you create a separate issue so that it is registered as a thing to do? I think that might be the way to organise what needs to be done actually - if we just make a separate issue for every new feature
I'll try to put things into a more structured overview today
@explodecomputer ok will do sorry!
thanks @mightyphil2000 i'm excited
so am I @explodecomputer
How do I add people to an issue?
@mightyphil2000 i think you can assign people by clicking the cog symbol next to assignees, which may be intended to mean that those people will implement it. but other than that i think everyone who has access to this repository can see the issues
ok did you see the new issue created? is it beautiful?
love it
At which point in the QC pipeline do the number of monomorphic SNPs or SNPs with missing or nonsense values get calculated? Will it be possible to include this information in the QC report? @explodecomputer I think Chris should be part of this discussion but I can't add him to this issue?
Here is an updated example of a BCF file on bluecrystal4:
Ultimately we would like to have a report for all
gwas-files/*/data.bcf
. We would also like the report to be run on new datasets as they are uploaded, which will be in a separate location.Changes in the bcf file format:
Metadata fields beginning with
##gwas.
are the info about that study extracted from the study table.##counts.
are the metrics regarding the harmonisation procedure.It is my sense that for the report the only thing that might need to be 'extracted' and used by the script is the
gwas.trait
field - the rest can just be displayed in a way that is easy to read.