FINNGEN / autoreporting

MIT License
0 stars 1 forks source link

tried to run ld based reporting and by stacktrace it looks like it still tries to do credbased #110

Closed Fedja closed 3 years ago

Fedja commented 4 years ago

Using the latest docker in container registry.... eu.gcr.io/finngen-refinery-dev/autorep:4d40b2b

main.py /cromwell_root/finngen_commons/est_ukb_meta/RHEUMA_SEROPOS_meta_out.gz --sign-treshold 5e-08 --alt-sign-treshold 0.01 --group --grouping-method ld --locus-width-kb 1500 --ld-panel-path /cromwell_root/finngen-imputation-panel/sisu3/wgs_all --ld-r2 0.4 --plink-memory 17000 --include-batch-freq --finngen-path /cromwell_root/r4_data_west1/annotations/R4_annotated_variants_v1.gz --functional-path /cromwell_root/r4_data_west1/gnomad_functional_variants/fin_enriched_genomes_select_columns.txt.gz --gnomad-genome-path /cromwell_root/fg-datateam-analysisteam-share/gnomad/2.1/genomes/gnomad.genomes.r2.1.sites.liftover.b38.finngen.r2pos.af.ac.an.tsv.gz --gnomad-exome-path /cromwell_root/fg-datateam-analysisteam-share/gnomad/2.1/exomes/gnomad.exomes.r2.1.sites.liftover.b38.finngen.r2pos.af.ac.an.tsv.gz --finngen-annotation-version r4 --use-gwascatalog --ld-treshold 0.7 --ldstore-threads 4 --gwascatalog-threads 8 --strict-group-r2 0.5 --gwascatalog-pval 5e-08 --gwascatalog-width-kb 25 --db local --column-labels #CHR POS REF ALT all_inv_var_meta_p all_inv_var_meta_beta FINNGEN_AF.Controls FINNGEN_AF.Cases FINNGEN_AF.Controls --local-gwascatalog /cromwell_root/r4_data_west1/autoreporting/gwas-catalog-associations_ontology-annotated-191007.tsv --efo-codes EFO_1001999 EFO_0002609 EFO_0000685 --ignore-region 6:23000000-38000000 --custom-dataresource /cromwell_root/r4_data_west1/autoreporting/custom_dataresource_r4_2020_03_25.tsv --fetch-out RHEUMA_SEROPOS.fetch.out --annotate-out RHEUMA_SEROPOS.annotate.out --report-out RHEUMA_SEROPOS.report.out --top-report-out RHEUMA_SEROPOS.top.out --ld-report-out RHEUMA_SEROPOS.ld.out --- phenotype RHEUMA_SEROPOS RETURN CODE: 1 --- --- phenotype RHEUMA_SEROPOS STDOUT --- input file: /cromwell_root/finngen_commons/est_ukb_meta/RHEUMA_SEROPOS_meta_out.gz filter & group SNPs Traceback (most recent call last): File "/usr/local/bin/main.py", line 135, in main(args) File "/usr/local/bin/main.py", line 46, in main ignore_region=args.ignore_region, cred_set_file=args.cred_set_file,ld_api=ld_api) File "/usr/local/bin/gws_fetch.py", line 231, in fetch_gws temp_df = merge_credset(temp_df,cs_df,gws_fpath,columns) File "/usr/local/bin/gws_fetch.py", line 191, in merge_credset df = pd.concat( [gws_df,cred_row_df], axis="index", ignore_index=True, sort=False).drop_duplicates(subset=list( join_cols ) ) File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 258, in concat return op.get_result() File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 473, in get_result mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 2059, in concatenate_block_managers return BlockManager(blocks, axes) File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 143, in init self._verify_integrity() File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 350, in _verify_integrity "tot_items: {1}".format(len(self.items), tot_items) AssertionError: Number of manager items must equal union of block items

manager items: 8, # tot_items: 9

Lipastomies commented 4 years ago

It fails in the part where we join the credible variants to the summary stat dataframe, which happens before we group the data.

I tested locally and the bug seems to be that concatenating an empty dataframe to non-empty dataframe does not work if there are duplicated columns (FINNGEN_AF.Controls is duplicated because there's no generic FG AF in the data)

What I think needs changing:

  1. skip the CS merging in case no CS was provided, so we have to have less error handling
  2. Currently the included columns are fixed (chr pos ref alt pval beta af af_cases af_controls), but not all data we want to use has these. I think it would make sense to separate these to those that are necessary for the script to work (chr pos ref alt pval), and those that we want to include in the result files (beta, af, af_cases, af_controls, rsid?). This could reduce these types of errors, and would add more flexibility for the data.
Fedja commented 4 years ago

yep do 1. for sure. 2. In meta I put these types of extra columns in --extra-cols which takes comma separated list of cols to spit out

Lipastomies commented 4 years ago

Relevant PRs: #111 #112

Lipastomies commented 4 years ago

As of 9ad5bc5, I did successfully run this after removing the duplicate columns in the call. Can you try on eu.gcr.io/finngen-refinery-dev/autorep:9ad5bc5 and check if it works? The wdl and json files for docker should be up to date. Specifically, I ran:


main.py RHEUMA_SEROPOS_meta_out.gz \
--sign-treshold 5e-08 --alt-sign-treshold 0.01 \
--group --grouping-method ld --locus-width-kb 1500 \
--ld-panel-path wgs_all --ld-r2 0.4 --plink-memory 21000 \
--ld-api online --include-batch-freq --finngen-path R4_annotated_variants_v1.gz \
--functional-path fin_enriched_genomes_select_columns.txt.gz \
--gnomad-genome-path gnomad.genomes.r2.1.sites.liftover.b38.finngen.r2pos.af.ac.an.tsv.gz\
 --gnomad-exome-path gnomad.exomes.r2.1.sites.liftover.b38.finngen.r2pos.af.ac.an.tsv.gz \
--finngen-annotation-version r4 --use-gwascatalog --ld-treshold 0.7 --ldstore-threads 4 \
--gwascatalog-threads 8 --strict-group-r2 0.5 --gwascatalog-pval 5e-08 \
--gwascatalog-width-kb 25 --db gwas --column-labels "#CHR" POS REF ALT all_inv_var_meta_p \
--extra-cols all_inv_var_meta_beta FINNGEN_AF.Controls FINNGEN_AF.Cases  \
--efo-codes EFO_1001999 EFO_0002609 EFO_0000685 --ignore-region 6:23000000-38000000 \
 --fetch-out RHEUMA_SEROPOS.fetch.out --annotate-out RHEUMA_SEROPOS.annotate.out \
--report-out RHEUMA_SEROPOS.report.out --top-report-out RHEUMA_SEROPOS.top.out \
--ld-report-out RHEUMA_SEROPOS.ld.out