MRCIEU / opengwas-reports

Report module for IEU GWAS pipeline
1 stars 0 forks source link

Update to new vcf format #18

Open explodecomputer opened 5 years ago

explodecomputer commented 5 years ago

After some discussion the bcf format is moving to a slightly modified vcf format (see https://github.com/MRCIEU/gwas_harmonisation/issues/15 for info)

I have made a new branch in this repository called 'vcf' which updates some of the functions to operate with the new format.

I'm struggling to get the report to generate though, e.g. in this directory

gwasdir="/mnt/storage/private/mrcieu/research/mr-eve/gwas-files/"
id="IEU-a:23"
refdata="/mnt/storage/private/mrcieu/research/mr-eve/vcf-reference-datasets/1000g/1kg_v3_nomult.bcf"

Rscript render_gwas_report.R \
        ${gwasdir}/${id}/${id}.vcf.gz \
        --output_dir ${gwasdir}/${id} \
        --refdata ${refdata} \
        --n_cores 1

I'm getting this error:

Quitting from lines 23-37 (template.Rmd)
Error in open.connection(con, "rb") : cannot open the connection
Calls: do.call ... parse_and_simplify -> parseJSON -> parse_con -> open -> open.connection
In addition: There were 11 warnings (use warnings() to see them)

Execution halted

As far as I can tell it is failing to find the metadata file, when generating the report, but not when running everything pre-report.

YiLiu6240 commented 5 years ago

I will have a go at this example file and see what comes out of it.

The error is most likely due to failure to fetch from the API but I will try to solve it along other issues.

mcgml commented 5 years ago

I'm annotating new VCF files with fixed dbsnp build and allele frequencies so no need to use an external ref_data anymore. @YiLiu6240 can you remove that option and take from within the file instead?

The tags are:

Cheers

YiLiu6240 commented 5 years ago

@explodecomputer @mcgml

The codebase has been almost updated and merged (branch master-update) to work with Gib's vcf format. I tried to minimise breaking changes with the existing codebase as much as possible. For example column names are still legacy column names and not the new abbreviations in the dataset.

There is a sample at bc4:/mnt/storage/private/mrcieu/research/scratch/IGD/data/public/report-module-sample. There are some sbatch files to show how to activate the conda environment and produce the report. Gib can you check you could use the new codebase to produce the reports for IEU-a datasets. Note that there is a "--id" arg to overwrite/specify gwas ids. Once you confirm this branch can work with your datasets I will push the changes to master.

As for further changes, at the moment the harmonised datasets from the two places (/mnt/storage/private/mrcieu/research/scratch/IGD/data/public/ukbiobank/vcf_09_19; /mnt/storage/private/mrcieu/research/scratch/IGD/data/public/processed) are in different formats. What is the current status of the agreed format of a harmonised dataset? It is better to avoid adding toggles and options at the report stage.

explodecomputer commented 5 years ago

Thanks @YiLiu6240 I have tried it on a few files and it seems to be working great.

Is the --id flag there because the api expects e.g. "2" but we are calling the files "IEU-a-2" ?

YiLiu6240 commented 5 years ago

Thanks @YiLiu6240 I have tried it on a few files and it seems to be working great.

Is the --id flag there because the api expects e.g. "2" but we are calling the files "IEU-a-2" ?

Yes, by default the ID will be parsed from "IEU-a-2.vcf.gz" as "IEU-a-2", but it is saner to specify it manually right now (produced report is always named as "IEU-a-2_report.html").

mcgml commented 5 years ago

@YiLiu6240 There is a VCF spec available here: https://github.com/MRCIEU/gwas_vcf_spec. Let me know if it needs more detail. Tom is requesting some data go through the pipeline. Are the changes you made ready for merging to master? Cheers Matt

YiLiu6240 commented 5 years ago

@mcgml I had a quick look at the spec, and the report code should run through and not fail but the format is not yet compliant with this and content of the report needs revision. As for codebase I the current HEAD is tied with master.

However I will need to update the codebase to be fully compliant with this spec -- is there a sample dataset I can test with?

As for the spec, is the AF field in INFO from the reference dataset associated with this VCF file? In other words, when there is the AF field present in the VCF then I don't need to annotate with a reference dataset?

I wonder whether we should document to what extent the formats from the IEU-a gwas and the UKB-b gwas agree/disagree with each other and with the common spec? I will also need to know these differences to account for them.

mcgml commented 5 years ago

Thanks @YiLiu6240 What do you use the AF field for? The pipeline will add the AF from 1kg so you shouldn't need to maintain a separate file. You can see an example file here: ieu-db-interface:/data/igd/IEU-b-1/IEU-b-1_data.vcf.gz. There shouldn't be any differences between UKB and MRB GWAS files but there are small differences between case/control & continuous outcomes (I'll make sure docs are clear on that: https://github.com/MRCIEU/gwas_vcf_spec/issues/4)

explodecomputer commented 5 years ago

I recreated the IEU-a and UKB-a using the same pipeline as UKB-b, so it should be consistent now

mcgml commented 5 years ago

@explodecomputer does gwasvcftools provide a VCF reading API?

explodecomputer commented 5 years ago

Yes I was just writing some documentation for it. Will link when it's done