MRCIEU / opengwas-reports

Report module for IEU GWAS pipeline
1 stars 0 forks source link

Discussion #4

Open explodecomputer opened 5 years ago

explodecomputer commented 5 years ago

Here is an updated example of a BCF file on bluecrystal4:

/mnt/storage/private/mrcieu/research/mr-eve/gwas-files/2/data.bcf

Ultimately we would like to have a report for all gwas-files/*/data.bcf. We would also like the report to be run on new datasets as they are uploaded, which will be in a separate location.

Changes in the bcf file format:

  1. There will be a sample size column (N)
  2. The B is now called BETA
  3. The PVAL column has been changed to L10PVAL - it is now storing -log10 p-values rather than raw p-values. This is because bcf stores things as floating point numbers which sometimes leads to rounding errors
  4. The header now has values stored like this:
##gwas.access=public
##gwas.author=Neale
##gwas.category=NA
##gwas.consortium=Neale Lab
##gwas.filename=Allele.reorder.20111_8.assoc.tsv.gz.tab
##gwas.id=UKB-a:222
##gwas.mr=1
##gwas.ncase=54270
##gwas.ncontrol=208091
##gwas.note=Data from http://www.nealelab.is/blog/2017/9/11/details-and-considerations-of-the-uk-biobank-gwas
##gwas.nsnp=10894596
##gwas.path=/projects/MRC-IEU/research/data/ukbiobank/summary/gwas/dev/cleaned_597_files/
##gwas.pmid=0
##gwas.population=European
##gwas.priority=1
##gwas.sample_size=259921
##gwas.sd=0
##gwas.sex=Males and Females
##gwas.subcategory=NA
##gwas.trait=Illnesses of siblings: High blood pressure
##gwas.unit=NA
##gwas.year=2017
##counts.total_variants=10879181
##counts.variants_not_read=0
##counts.variants_with_missing_stats=1
##counts.variants_with_missing_pvals=0
##counts.id.exposure=ooMO7A
##counts.id.outcome=6KI48H
##counts.alleles=2-2
##counts.switched_alleles=3
##counts.flipped_alleles_basic=33884
##counts.flipped_alleles_palindrome=0
##counts.incompatible_alleles=46792
##counts.candidate_variants=10545186
##counts.variants_absent_from_reference=0
##counts.total_variants_for_mr=10498394
##counts.proxy_variants=0
##counts.harmonised_variants=10498394
##counts.variants_not_harmonised=46792
##counts.total_remaining_variants=10545186

Metadata fields beginning with ##gwas. are the info about that study extracted from the study table. ##counts. are the metrics regarding the harmonisation procedure.

It is my sense that for the report the only thing that might need to be 'extracted' and used by the script is the gwas.trait field - the rest can just be displayed in a way that is easy to read.

explodecomputer commented 5 years ago

The basic workflow: 1) extract relevant columns from harmoinised.bcf back to a tabular file, read into R, joined with the reference allele frequency column from the tabular file from the reference data. 2) calculate metrics, read metadata metrics file and mrbase metadata, and render the plots and report.

The things I think need refactoring:

  • The input gwas can optionally be a tabular file.

Don't work too hard on making it very general - I think generally what will happen is

  1. Data will be harmonised against the genome reference and turned into bcf. This involves a form of QC which is checking that the alleles and SNP identifiers align to the genome
  2. Report generated based on bcf predominantly looks at the summary statistics
  • The metadata metrics file will be deprecated, and the metadata metrics read from the bcf header, and in case the input gwas is a tabular file, the metadata section in the report will be empty. Until we have rules in what metrics to be included in the bcf in uniform, all header information will be included in the report.

Sounds good

  • Regarding the reference data, what I read from your earlier email is that this is uniquely associated? In that sense, what are the reference datasets associated with the uk biobank gwas?

The reference data is essentially a list of positions in the genome with the canonical reference allele based on that version of the human genome. It is there to harmonise the uk bb gwas so that the effect allele is always related to the non-reference allele. See here https://ieugit-scmv-d0.epi.bris.ac.uk/ml18692/gwas_harmonisation

  • I think the systematic QC metrics can be added to the report module as a sub-component, and produced as a "qc-metrics.json" which can then be read into the rmarkdown report, or used as-is. To systematically QC the 20,000 uk biobank gwas we can produce a qc-metrics.json for each of them and then aggregate these metrics together.

Sounds sensible

  • Regarding Phil's point on allowing user access to the reports, are there any potentially sensitive information to only be included in a separate internal version? In the current dev branch I use Hmisc::describe for summary stats and I am thinking of adding previews on head/tail section of the input data.

I think generally not going to be anything sensitive

explodecomputer commented 5 years ago

Chris in response to your points, 1) There is a sample size column now for the legacy data. None of the previously formatted datasets have ncase and ncontrol, so I can’t do this unless somebody reformats those datasets. We could try to do it for datasets going forwards 2) There were 2.5m SNPs in the GIANT BMI dataset because it was a hapmap2 study. The metadata in the bcf files has the info about how many variants are read, how many are removed for different reasons, how many are left etc 3) What the bcf conversion does now is 1. Align snps to a reference (so it checks chr, pos, alleles) and 2. For all numeric columns it is setting non-finite values to missing. My worry about the qc.py script is that it might remove a lot of things that we want to keep e.g. indels, multi-allelic variants, variants that don’t have an rs id etc. If you could identify the specific things related to the numeric columns that need to be done that would be very useful.

Finally, there is no harm in doing this iteratively. If we version the reports then as long as it is producing the main things to begin with (basically what Yi has implemented already – most important is lambda and qq plot to begin with) and it is functioning reliably then we can incrementally add extra qc metrics over time. My strong preference would be to have the basic programmatic interface completed first and then add extra functionality afterwards. This means that we can get on with the rest of the pipeline.

YiLiu6240 commented 5 years ago

Hi @explodecomputer thanks for the new data and opening the discussion. It would be good to have a centralised place (maybe https://github.com/MRCIEU/mr-base-structure or a public repo) to document 1) the overall architecture of the gwas pipeline, and 2) each component of the pipeline and their inputs / outputs.

For the report module I have finished refactoring it, and regarding @elswob's ukbiobank gwas after we can hopefully settle on bcf format for this dataset it will be easy to compute systematic qc metrics for the whole dataset.

ps. the beta column is called "EFFECT" in data.bcf and I will use this format till we have a new format in place.

mightyphil2000 commented 5 years ago

THanks Gib.

Another thing we have to do is produce a metaQC report at the very end that includes and compares all QC metrics for all studies in one file. E.g. multiple scatter plots comparing the QC metrics against sample size for all the studies. IMO this is will be a very effective way for spotting outlier and problematic studies.

explodecomputer commented 5 years ago

Thanks @YiLiu6240. Another change that was made over the weekend after discussion with @mcgml was that it might be easier to not have the gwas.* fields because they are liable to change, and just have the the gwas.id so that you can retrieve the info from the API (e.g. for study 2: http://apitest.mrbase.org/gwasinfo/2)

@mightyphil2000 agree, can you create a separate issue so that it is registered as a thing to do? I think that might be the way to organise what needs to be done actually - if we just make a separate issue for every new feature

I'll try to put things into a more structured overview today

mightyphil2000 commented 5 years ago

@explodecomputer ok will do sorry!

explodecomputer commented 5 years ago

thanks @mightyphil2000 i'm excited

mightyphil2000 commented 5 years ago

so am I @explodecomputer

How do I add people to an issue?

explodecomputer commented 5 years ago

@mightyphil2000 i think you can assign people by clicking the cog symbol next to assignees, which may be intended to mean that those people will implement it. but other than that i think everyone who has access to this repository can see the issues

mightyphil2000 commented 5 years ago

ok did you see the new issue created? is it beautiful?

explodecomputer commented 5 years ago

love it

mightyphil2000 commented 5 years ago

At which point in the QC pipeline do the number of monomorphic SNPs or SNPs with missing or nonsense values get calculated? Will it be possible to include this information in the QC report? @explodecomputer I think Chris should be part of this discussion but I can't add him to this issue?