UKB sumstats wrangling & ingest

ljwh2 commented 8 months ago

Data associated with this project https://www.medrxiv.org/content/10.1101/2023.12.06.23299426v1 has been shared with Open Targets and needs ingesting into the GWAS Catalog. The data is presented in separate files for each chromosome, looks like ~35M variants per GWAS.

[x] Liaise with David & Daniel at Open Targets to get access to the data (currently in their cloud storage)
[x] Combine chromosome-specific files into a single file per GWAS (study)
- [x] Quant
- [x] Binary - restarted merge for corrupt studies
[x] Reformat to GWAS-SSF
- [x] Quant - find at /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats_quant/
- [x] Binary - find at /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats/
[x] Check file integrity of formatted files - find them at /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/
[x] Clean up the rows with TEST_FAIL in the EXTRA column - SLURM jobs done but there are errors e.g. files with some chrs missing:
- [x] Investigate files with some chromosomes missing - did some investigation, only studies below are due to formatting error caused by our pipeline. Others come from the original files.
- [x] Reformat m06_AFR, m54_SAS, m77_ASJ
[x] Verify that the validation pipeline can cope with files of this size
- [x] Quant - sample file passed the validation
[x] Wrangle the metadata template (see email attachment) to match the files, i.e., copy files to private ftp and compare md5sum values of the ones in private ftp and aws/formatted_long/
- [x] Quant
- [x] Binary
[x] Create sandbox submission
- [x] Quant - all validated https://wwwdev.ebi.ac.uk/gwas/deposition/submission/66f296769a68730001ae3633
- [x] Binary - all validated https://wwwdev.ebi.ac.uk/gwas/deposition/submission/671f89519a6873000169987e
- [x] Binary & Quant together in sandbox (see Slack for details) - https://wwwdev.ebi.ac.uk/gwas/deposition/submission/67210e7f9a687300019236c1
[ ] Before queuing for harmonisation or making a submission in prod, harmonise one file and check variant dropout rate. Discuss results before proceeding.
- [ ] Quant
- [ ] Binary
[ ] Create submission on behalf of the author for immediate release (not under embargo)
- [ ] Binary & Quant together in prod - see Slack for details https://gwas-catalog.slack.com/archives/C02N9FSUDCL/p1729789741784489

@earlEBI can provide support in interpreting the template and especially with the template wrangling and submission steps.

karatugo commented 7 months ago

Started copying files to /hps/nobackup

      65202370  standard goci1267  spotbot  R    1:13:10      1 hl-codon-111-03

karatugo commented 7 months ago

Copying complete. Now comparing MD5 checksums of the copied files to those listed in md5sums.txt.

karatugo commented 7 months ago

Submitted a SLURM job to calculate and compare md5sums of files in GCP and Codon.

      66126819  standard md5sum-c  spotbot  R      01:26      1 hl-codon-09-03

karatugo commented 7 months ago

I calculated and compared the MD5 sums of files in GCP and Codon. They matched.

karatugo commented 7 months ago

Submitted two SLURM jobs to combine chromosome-specific files into a single file per GWAS (study):

         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      66784780 datamover    merge  spotbot  R       2:20      1 codon-dm-05
        162292 datamover merge-re  spotbot  R       0:24      1 codon-dm-05

Note that all .txt.gz files are in gwas_summary_stats and all .regenie.gz files are in gwas_summary_stats_quant.

karatugo commented 6 months ago

I've noticed that some file MD5 sums are missing, for example for ./gwas_summary_stats/j92/. Additionally, there are warnings indicating that these files are corrupt when I try to combine them.

I did not detect this issue earlier because the affected files lack entries in md5sums.txt. I have reached out to Annalisa about this.

karatugo commented 5 months ago

I have access to S3 now.

karatugo commented 5 months ago

Submitted two SLURM jobs for the data copy to codon, namely, cp_ukb_aws_gwas_summary_stats and cp_ukb_aws_gwas_summary_stats_quant using the sbatch scripts /homes/spotbot/goci-1267/cp_from_aws_gwas_summary_stats.sh and /homes/spotbot/goci-1267/cp_from_aws_gwas_summary_stats_quant.sh

karatugo commented 5 months ago

Every file in our directory matches exactly with the files listed in md5sums.txt, and vice versa. See the script at /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/compare_md5sums.sh

karatugo commented 5 months ago

Submitted batch job 16272725 to compute and compare md5sums of the copied files.

karatugo commented 5 months ago

Compute and compare md5sums of the copied files done. The values matched.

karatugo commented 5 months ago

Submitted batch job 16375825 to backup copied files.

karatugo commented 5 months ago

Backup copied files complete.

karatugo commented 5 months ago

Submitted batch jobs 16648847 and 16648861 to combine chromosome-specific files into a single file per GWAS (study).

karatugo commented 5 months ago

16648861 - merge-regenie complete.

karatugo commented 5 months ago

For regenie studies:

No files found for blood_biochemistry_oest_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_oest_0 with ancestry EAS. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry AFR. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry SAS. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry EAS. Skipping...
No files found for blood_biochemistry_urma_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_urma_0 with ancestry EAS. Skipping...

karatugo commented 5 months ago

Submitted a gwas-ssf format SLURM job for formatting regenie files. Expect them in here in 2 days: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted/gwas_summary_stats_quant/

karatugo commented 5 months ago

16648847: merge_gwas_summary_stats Ended

karatugo commented 5 months ago

Attached is the skipped list for gwas_summary_stats

gwas_sumstats_skipped_list.txt

karatugo commented 5 months ago

Unfortunately Disk quota exceeded for spot/gwas/scratch/ and some of the merge operations failed. I'll talk to the Codon team for how to best navigate this issue.

karatugo commented 5 months ago

Talked to Codon team.

[ ] I'll move copied files and backup files to lts (and talk to Storage team if we don't already have space)
[x] Restart the merge job (ideally with parallel compression)

karatugo commented 5 months ago

Tested the validation with the file blood_biochemistry_ua_0_EAS_combined_formatted.regenie.gz, it worked okay

karatugo commented 4 months ago

merge job for gwas_summary_stats resumed.

Submitted batch job 22796315

karatugo commented 4 months ago

Done - a SLURM job for gathering data (e.g. md5sums of the files, calculating variant counts etc.) for the metadata template.

karatugo commented 4 months ago

Submitted a SLURM job for copying Quant files to the private ftp for test submission in sandbox.

      23899637 datamover cp-ukbb-quant-private-ftp                               spotbot  R  2:57:34    1      codon-dm-04

karatugo commented 4 months ago

Submission template is now ready. Lizzy updated it and fixed the errors.

karatugo commented 4 months ago

Possible errors found during the merge of Binary studies:

gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/n34/chr9_first_occurrence_n34_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/n34/chr9_first_occurrence_n34_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr18_first_occurrence_o35_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr18_first_occurrence_o35_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr21_first_occurrence_o35_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr21_first_occurrence_o35_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o91/chr20_first_occurrence_o91_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o91/chr20_first_occurrence_o91_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/p90_p96_other_disorders_originating_in_the_perinatal_period/chr7_first_occurrence_p90_p96_other_disorders_originating_in_the_perinatal_period_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/p90_p96_other_disorders_originating_in_the_perinatal_period/chr7_first_occurrence_p90_p96_other_disorders_originating_in_the_perinatal_period_NFE.txt.gz: invalid compressed data--length error

md5sum values of the files and the values in the list matched. This means that possibly files are corrupted on the source.

karatugo commented 4 months ago

submitted format jobs in SLURM using gwas-ssf format command for binary studies.

      23949330 short     temp_sbatch_script.sh                                   spotbot  R  2:30       1      hl-codon-130-01
      23949331 short     temp_sbatch_script.sh                                   spotbot  R  2:30       1      hl-codon-bm-10
      23949332 short     temp_sbatch_script.sh                                   spotbot  R  2:30       1      hl-codon-bm-10
      23949333 short     temp_sbatch_script.sh                                   spotbot  R  2:30       1      hl-codon-bm-10

karatugo commented 4 months ago

Job codon-slurm.23972158: compare-md5sum-quant-private-ftp Began

karatugo commented 4 months ago

Job codon-slurm.23972158: compare-md5sum-quant-private-ftp complete. No issues with md5sum values for Quant files in the private ftp and aws/formatted/gwas_summary_stats_quant.

karatugo commented 4 months ago

Submitted the test submission to Sandbox but an error popped up.

Error:

Jun-24 14:15:48.996 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'validate_study (39)'

Caused by:
  Process `validate_study (39)` terminated with an error exit status (1)

Command executed:

  validate-study -cid 6ny4ZfSZ -id 66797b6a2b4603000160cf9e -payload /hps/nobackup/parkinso/spot/gwas/data/sumstats/depo/dev/validate/6ny4ZfSZ/payload.json -storepath /hps/nobackup/parkinso/spot/gwas/data/sumstats/
depo/dev/store -minrows None -zero_p True -forcevalid false -out "66797b6a2b4603000160cf9e.json" -validated_path /hps/nobackup/parkinso/spot/gwas/data/sumstats/depo/dev/validate

Command exit status:
  1

Command output:
  Validating extension
  --> Ok
  Validating column order
  --> Ok
  Validating minimum row count
  --> Ok
  Validating the first 100000 rows
  --> Ok
  Validating the rest of the file

Command error:
  (ERROR): Logging setup failed: [Errno 2] No such file or directory: '/var/log/gunicorn/sumstats-error.log'
  Traceback (most recent call last):
    File "/usr/local/bin/validate-study", line 33, in <module>
      sys.exit(load_entry_point('gwas-sumstats-service', 'console_scripts', 'validate-study')())
    File "/sumstats_service/sumstats_service/resources/validate_study.py", line 238, in main
      validate_study(
    File "/sumstats_service/sumstats_service/resources/validate_study.py", line 55, in validate_study
      study.validate_study(
    File "/sumstats_service/sumstats_service/resources/study_service.py", line 256, in validate_study
      ssf.validate_file() if forcevalid is False else True
    File "/sumstats_service/sumstats_service/resources/file_handler.py", line 181, in validate_file
      status, message = validator.validate()
    File "/usr/local/lib/python3.9/site-packages/gwas_sumstats_tools/validate.py", line 71, in validate
      for df in df_iter:
    File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1698, in __next__
      return self.get_chunk()
    File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1810, in get_chunk
      return self.read(nrows=size)
    File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1778, in read
      ) = self._engine.read(  # type: ignore[attr-defined]
    File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
      chunks = self._reader.read_low_memory(nrows)
    File "pandas/_libs/parsers.pyx", line 820, in pandas._libs.parsers.TextReader.read_low_memory
    File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
    File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
    File "pandas/_libs/parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
    File "/usr/local/lib/python3.9/_compression.py", line 68, in readinto
      data = self.read(len(byte_view))
    File "/usr/local/lib/python3.9/gzip.py", line 506, in read
      raise EOFError("Compressed file ended before the "
  EOFError: Compressed file ended before the end-of-stream marker was reached

Work dir:
  /hps/nobackup/parkinso/spot/gwas/data/sumstats/depo/dev/validate/6ny4ZfSZ/68/de98e4317c4a7fad4a461e98fb47bf

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
Jun-24 14:15:49.003 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Process `validate_study (39)` terminated with an error exit status (1)

for the file

{"id": "66797b6a2b4603000160cf9e", "filePath": "blood_biochemistry_ca_0_NFE_combined_formatted.regenie.gz", "md5": "03968288dc8eac3f20af4b03459b6bd1", "assembly": "GRCh38", "readme": "pre-print describing the study here: https://www.medrxiv.org/content/10.1101/2023.12.06.23299426v1", "entryUUID": "52cf68b0-b90b-4fd2-aa62-18ee8f01d197", "analysisSoftware": "regenie_v3.2.5"},

It seems like an error on our end. I'm running the formatter for the erroneous file again.

karatugo commented 4 months ago

Job codon-slurm.24489700: Began for Binary studies metadata template data wrangling.

karatugo commented 4 months ago

Job codon-slurm.24625625: cp-ukbb-binary-private-ftp Began

karatugo commented 4 months ago

Fixed the error here (https://app.zenhub.com/workspaces/gwas-59df823c4a6feb3786810391/issues/gh/ebispot/goci/1267#issuecomment-2186844923) by reformatting the file and submitted test submission to Sandbox again.

karatugo commented 4 months ago

Same error (https://app.zenhub.com/workspaces/gwas-59df823c4a6feb3786810391/issues/gh/ebispot/goci/1267#issuecomment-2188608204) with different file. I suspect something went wrong during formatting step (perhaps wallclock limit?). So I asked @jiyue1214 for help.

karatugo commented 4 months ago

Test submission made for Binary studies in Sandbox.

Update: Fixed the samples error in the template and restarted the validation. https://wwwdev.ebi.ac.uk/gwas/deposition/submission/667a98142b4603000160eeef

karatugo commented 4 months ago

To discuss tomorrow:

line_number  chromosome  base_pair_location effect_allele other_allele     beta  standard_error  effect_allele_frequency  p_value                    ID  INFO     n TEST  CHISQ      EXTRA
1164991          11            12006354             C       CTCATT  3.31409        0.918649                 0.001612      NaN  11:12006354:CTCATT:C     1  9613  ADD    NaN  TEST_FAIL

Validation failed for Binary studies as there are lines that have NA p-values. How to handle them? I've seen at least 3 files fail for the same reason.

eks-ebi commented 4 months ago

suggest checking how many rows are affected, and whether all of the rows with EXTRA = TEST_FAIL are also the only rows with p_value = NaN

karatugo commented 4 months ago

@karatugo Yue suggested to restart formatting quant studies by increasing the wallclock here. https://github.com/EBISPOT/gwas-sumstats-tools/blob/0cde0fbe08dcca352637d48ddfe29d8f40871886/gwas_sumstats_tools/format.py#L369C1-L369C76

karatugo commented 4 months ago

For b36_SAS_combined_formatted.txt.gz: yes, there's only one row with EXTRA=TEST_FAIL.
For a04_NFE_combined_formatted.tsv.gz: we have formatting issues for this file. Need to re-format.
For a65_a69_other_spirochetal_diseases_NFE_combined_formatted.tsv.gz: there are more than one row, for example:

          chromosome  base_pair_location effect_allele other_allele     beta  standard_error  effect_allele_frequency  p_value               ID INFO       n TEST  CHISQ      EXTRA
2464756           10            68424514             C            A  10.3522         1.95600                 0.000222      NaN  10:68424514:A:C    1  458440  ADD    NaN  TEST_FAIL
13963875          13            16962399             A            T  11.3097         3.52370                 0.000028      NaN  13:16962399:T:A    1  452061  ADD    NaN  TEST_FAIL
17656265          14            27827504             C            T  11.6465         3.57650                 0.000037      NaN  14:27827504:T:C    1  458440  ADD    NaN  TEST_FAIL
23377501          16              656644             C            G  12.3018         2.58835                 0.000131      NaN    16:656644:G:C    1  458440  ADD    NaN  TEST_FAIL
23385383          16              797674             T            C  11.9224         3.61427                 0.000034      NaN    16:797674:C:T    1  458440  ADD    NaN  TEST_FAIL

I'll run another script to find the exact number and if all rows with NA p-value has TEST_FAIL.

karatugo commented 4 months ago

There are 12 rows and all rows with NA p-value have TEST_FAIL.

          chromosome  base_pair_location effect_allele other_allele     beta  standard_error  effect_allele_frequency  p_value               ID INFO       n TEST  CHISQ      EXTRA
2464756           10            68424514             C            A  10.3522         1.95600                 0.000222      NaN  10:68424514:A:C    1  458440  ADD    NaN  TEST_FAIL
13963875          13            16962399             A            T  11.3097         3.52370                 0.000028      NaN  13:16962399:T:A    1  452061  ADD    NaN  TEST_FAIL
17656265          14            27827504             C            T  11.6465         3.57650                 0.000037      NaN  14:27827504:T:C    1  458440  ADD    NaN  TEST_FAIL
23377501          16              656644             C            G  12.3018         2.58835                 0.000131      NaN    16:656644:G:C    1  458440  ADD    NaN  TEST_FAIL
23385383          16              797674             T            C  11.9224         3.61427                 0.000034      NaN    16:797674:C:T    1  458440  ADD    NaN  TEST_FAIL
41033987           1           211709313             A            T  12.2723         3.66103                 0.000029      NaN  1:211709313:T:A    1  458440  ADD    NaN  TEST_FAIL
42454826          20             3817725             G            A  11.8014         3.59597                 0.000040      NaN   20:3817725:A:G    1  458440  ADD    NaN  TEST_FAIL
63075343           4            13205321             T            C  10.7092         3.44429                 0.000036      NaN   4:13205321:C:T    1  458440  ADD    NaN  TEST_FAIL
67069635           4           131313744             C            T  13.0730         2.66293                 0.000134      NaN  4:131313744:T:C    1  458440  ADD    NaN  TEST_FAIL
67069916           4           131322541             G            A  13.0730         2.66293                 0.000134      NaN  4:131322541:A:G    1  458440  ADD    NaN  TEST_FAIL
67069918           4           131322685             C            T  20.1949         3.26687                 0.000038      NaN  4:131322685:T:C    1  458440  ADD    NaN  TEST_FAIL
67077655           4           131542946             A            G  20.4077         3.28253                 0.000029      NaN  4:131542946:G:A    1  458440  ADD    NaN  TEST_FAIL

karatugo commented 4 months ago

Restarted formatting of Quant and Binary studies.

karatugo commented 4 months ago

I was able to sync corrupt files but during merge step I've encountered more.

karatugo commented 3 months ago

Job codon-slurm.33194820: cp_ukb_aws_gwas_summary_stats_corrupt Began

karatugo commented 3 months ago

Job codon-slurm.33194820: cp_ukb_aws_gwas_summary_stats_corrupt Ended

karatugo commented 3 months ago

Job codon-slurm.33198144: merge-corrupt Began

karatugo commented 3 months ago

Job codon-slurm.33198144: merge-corrupt Ended -- I see no errors this time.

karatugo commented 3 months ago

Restarted formatting of Binary studies. Expect them in /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats/ in 2 days. Also, quant studies in /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats_quant/

karatugo commented 3 months ago

Some studies' formatting failed again due to time limit again. Restarting this time with 48h. Please find the updated code at /hps/software/users/parkinso/spot/gwas/dev/gwas-sumstats-tools or use format-long conda env.

karatugo commented 3 months ago

After formatting, for gwas_summary_stats files, there are 4 error files.

-rw-rw-r-- 1 spotbot spot 118828 Jul 19 14:07 slurm-36124917.err - wil ask submitter to upload their chr13 file again
-rw-rw-r-- 1 spotbot spot     10 Jul 19 14:08 slurm-36124646.err - FIXED
-rw-rw-r-- 1 spotbot spot 116561 Jul 19 14:42 slurm-36124844.err - will ask submitter to upload their chr5 file again
-rw-rw-r-- 1 spotbot spot     10 Jul 19 14:55 slurm-36124522.err - FIXED

Working on formatting them manually after fixing the issues with the files.

EBISPOT / goci

UKB sumstats wrangling & ingest #1267