Open ljwh2 opened 8 months ago
Started copying files to /hps/nobackup
65202370 standard goci1267 spotbot R 1:13:10 1 hl-codon-111-03
Copying complete. Now comparing MD5 checksums of the copied files to those listed in md5sums.txt.
Submitted a SLURM job to calculate and compare md5sums of files in GCP and Codon.
66126819 standard md5sum-c spotbot R 01:26 1 hl-codon-09-03
I calculated and compared the MD5 sums of files in GCP and Codon. They matched.
Submitted two SLURM jobs to combine chromosome-specific files into a single file per GWAS (study):
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
66784780 datamover merge spotbot R 2:20 1 codon-dm-05
162292 datamover merge-re spotbot R 0:24 1 codon-dm-05
Note that all .txt.gz
files are in gwas_summary_stats
and all .regenie.gz
files are in gwas_summary_stats_quant
.
I've noticed that some file MD5 sums are missing, for example for ./gwas_summary_stats/j92/. Additionally, there are warnings indicating that these files are corrupt when I try to combine them.
I did not detect this issue earlier because the affected files lack entries in md5sums.txt. I have reached out to Annalisa about this.
I have access to S3 now.
Submitted two SLURM jobs for the data copy to codon, namely, cp_ukb_aws_gwas_summary_stats
and cp_ukb_aws_gwas_summary_stats_quant
using the sbatch scripts /homes/spotbot/goci-1267/cp_from_aws_gwas_summary_stats.sh
and /homes/spotbot/goci-1267/cp_from_aws_gwas_summary_stats_quant.sh
Every file in our directory matches exactly with the files listed in md5sums.txt, and vice versa. See the script at /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/compare_md5sums.sh
Submitted batch job 16272725 to compute and compare md5sums of the copied files.
Compute and compare md5sums of the copied files done. The values matched.
Submitted batch job 16375825 to backup copied files.
Backup copied files complete.
Submitted batch jobs 16648847 and 16648861 to combine chromosome-specific files into a single file per GWAS (study).
16648861 - merge-regenie complete.
For regenie studies:
No files found for blood_biochemistry_oest_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_oest_0 with ancestry EAS. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry AFR. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry SAS. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_rhaf_0 with ancestry EAS. Skipping...
No files found for blood_biochemistry_urma_0 with ancestry ASJ. Skipping...
No files found for blood_biochemistry_urma_0 with ancestry EAS. Skipping...
Submitted a gwas-ssf format
SLURM job for formatting regenie files. Expect them in here in 2 days: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted/gwas_summary_stats_quant/
16648847: merge_gwas_summary_stats Ended
Attached is the skipped list for gwas_summary_stats
Unfortunately Disk quota exceeded
for spot/gwas/scratch/
and some of the merge operations failed. I'll talk to the Codon team for how to best navigate this issue.
Talked to Codon team.
lts
(and talk to Storage team if we don't already have space)Tested the validation with the file blood_biochemistry_ua_0_EAS_combined_formatted.regenie.gz
, it worked okay
merge job for gwas_summary_stats
resumed.
Submitted batch job 22796315
Done - a SLURM job for gathering data (e.g. md5sums of the files, calculating variant counts etc.) for the metadata template.
Submitted a SLURM job for copying Quant files to the private ftp for test submission in sandbox.
23899637 datamover cp-ukbb-quant-private-ftp spotbot R 2:57:34 1 codon-dm-04
Submission template is now ready. Lizzy updated it and fixed the errors.
Possible errors found during the merge of Binary studies:
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/n34/chr9_first_occurrence_n34_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/n34/chr9_first_occurrence_n34_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr18_first_occurrence_o35_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr18_first_occurrence_o35_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr21_first_occurrence_o35_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o35/chr21_first_occurrence_o35_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o91/chr20_first_occurrence_o91_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/o91/chr20_first_occurrence_o91_NFE.txt.gz: invalid compressed data--length error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/p90_p96_other_disorders_originating_in_the_perinatal_period/chr7_first_occurrence_p90_p96_other_disorders_originating_in_the_perinatal_period_NFE.txt.gz: invalid compressed data--crc error
gzip: /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/gwas_summary_stats/p90_p96_other_disorders_originating_in_the_perinatal_period/chr7_first_occurrence_p90_p96_other_disorders_originating_in_the_perinatal_period_NFE.txt.gz: invalid compressed data--length error
md5sum values of the files and the values in the list matched. This means that possibly files are corrupted on the source.
submitted format jobs in SLURM using gwas-ssf format command for binary studies.
23949330 short temp_sbatch_script.sh spotbot R 2:30 1 hl-codon-130-01
23949331 short temp_sbatch_script.sh spotbot R 2:30 1 hl-codon-bm-10
23949332 short temp_sbatch_script.sh spotbot R 2:30 1 hl-codon-bm-10
23949333 short temp_sbatch_script.sh spotbot R 2:30 1 hl-codon-bm-10
Job codon-slurm.23972158: compare-md5sum-quant-private-ftp Began
Job codon-slurm.23972158: compare-md5sum-quant-private-ftp complete. No issues with md5sum values for Quant files in the private ftp and aws/formatted/gwas_summary_stats_quant.
Submitted the test submission to Sandbox but an error popped up.
Error:
Jun-24 14:15:48.996 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'validate_study (39)'
Caused by:
Process `validate_study (39)` terminated with an error exit status (1)
Command executed:
validate-study -cid 6ny4ZfSZ -id 66797b6a2b4603000160cf9e -payload /hps/nobackup/parkinso/spot/gwas/data/sumstats/depo/dev/validate/6ny4ZfSZ/payload.json -storepath /hps/nobackup/parkinso/spot/gwas/data/sumstats/
depo/dev/store -minrows None -zero_p True -forcevalid false -out "66797b6a2b4603000160cf9e.json" -validated_path /hps/nobackup/parkinso/spot/gwas/data/sumstats/depo/dev/validate
Command exit status:
1
Command output:
Validating extension
--> Ok
Validating column order
--> Ok
Validating minimum row count
--> Ok
Validating the first 100000 rows
--> Ok
Validating the rest of the file
Command error:
(ERROR): Logging setup failed: [Errno 2] No such file or directory: '/var/log/gunicorn/sumstats-error.log'
Traceback (most recent call last):
File "/usr/local/bin/validate-study", line 33, in <module>
sys.exit(load_entry_point('gwas-sumstats-service', 'console_scripts', 'validate-study')())
File "/sumstats_service/sumstats_service/resources/validate_study.py", line 238, in main
validate_study(
File "/sumstats_service/sumstats_service/resources/validate_study.py", line 55, in validate_study
study.validate_study(
File "/sumstats_service/sumstats_service/resources/study_service.py", line 256, in validate_study
ssf.validate_file() if forcevalid is False else True
File "/sumstats_service/sumstats_service/resources/file_handler.py", line 181, in validate_file
status, message = validator.validate()
File "/usr/local/lib/python3.9/site-packages/gwas_sumstats_tools/validate.py", line 71, in validate
for df in df_iter:
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1698, in __next__
return self.get_chunk()
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1810, in get_chunk
return self.read(nrows=size)
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/usr/local/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 820, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1965, in pandas._libs.parsers.raise_parser_error
File "/usr/local/lib/python3.9/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/local/lib/python3.9/gzip.py", line 506, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
Work dir:
/hps/nobackup/parkinso/spot/gwas/data/sumstats/depo/dev/validate/6ny4ZfSZ/68/de98e4317c4a7fad4a461e98fb47bf
Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
Jun-24 14:15:49.003 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Process `validate_study (39)` terminated with an error exit status (1)
for the file
{"id": "66797b6a2b4603000160cf9e", "filePath": "blood_biochemistry_ca_0_NFE_combined_formatted.regenie.gz", "md5": "03968288dc8eac3f20af4b03459b6bd1", "assembly": "GRCh38", "readme": "pre-print describing the study here: https://www.medrxiv.org/content/10.1101/2023.12.06.23299426v1", "entryUUID": "52cf68b0-b90b-4fd2-aa62-18ee8f01d197", "analysisSoftware": "regenie_v3.2.5"},
It seems like an error on our end. I'm running the formatter for the erroneous file again.
Job codon-slurm.24489700: Began for Binary studies metadata template data wrangling.
Job codon-slurm.24625625: cp-ukbb-binary-private-ftp Began
Fixed the error here (https://app.zenhub.com/workspaces/gwas-59df823c4a6feb3786810391/issues/gh/ebispot/goci/1267#issuecomment-2186844923) by reformatting the file and submitted test submission to Sandbox again.
Same error (https://app.zenhub.com/workspaces/gwas-59df823c4a6feb3786810391/issues/gh/ebispot/goci/1267#issuecomment-2188608204) with different file. I suspect something went wrong during formatting step (perhaps wallclock limit?). So I asked @jiyue1214 for help.
Test submission made for Binary studies in Sandbox.
Update: Fixed the samples error in the template and restarted the validation. https://wwwdev.ebi.ac.uk/gwas/deposition/submission/667a98142b4603000160eeef
To discuss tomorrow:
line_number chromosome base_pair_location effect_allele other_allele beta standard_error effect_allele_frequency p_value ID INFO n TEST CHISQ EXTRA
1164991 11 12006354 C CTCATT 3.31409 0.918649 0.001612 NaN 11:12006354:CTCATT:C 1 9613 ADD NaN TEST_FAIL
Validation failed for Binary studies as there are lines that have NA
p-values. How to handle them? I've seen at least 3 files fail for the same reason.
suggest checking how many rows are affected, and whether all of the rows with EXTRA = TEST_FAIL are also the only rows with p_value = NaN
@karatugo Yue suggested to restart formatting quant studies by increasing the wallclock here. https://github.com/EBISPOT/gwas-sumstats-tools/blob/0cde0fbe08dcca352637d48ddfe29d8f40871886/gwas_sumstats_tools/format.py#L369C1-L369C76
b36_SAS_combined_formatted.txt.gz
: yes, there's only one row with EXTRA=TEST_FAIL
.a04_NFE_combined_formatted.tsv.gz
: we have formatting issues for this file. Need to re-format.a65_a69_other_spirochetal_diseases_NFE_combined_formatted.tsv.gz
: there are more than one row, for example: chromosome base_pair_location effect_allele other_allele beta standard_error effect_allele_frequency p_value ID INFO n TEST CHISQ EXTRA
2464756 10 68424514 C A 10.3522 1.95600 0.000222 NaN 10:68424514:A:C 1 458440 ADD NaN TEST_FAIL
13963875 13 16962399 A T 11.3097 3.52370 0.000028 NaN 13:16962399:T:A 1 452061 ADD NaN TEST_FAIL
17656265 14 27827504 C T 11.6465 3.57650 0.000037 NaN 14:27827504:T:C 1 458440 ADD NaN TEST_FAIL
23377501 16 656644 C G 12.3018 2.58835 0.000131 NaN 16:656644:G:C 1 458440 ADD NaN TEST_FAIL
23385383 16 797674 T C 11.9224 3.61427 0.000034 NaN 16:797674:C:T 1 458440 ADD NaN TEST_FAIL
I'll run another script to find the exact number and if all rows with NA p-value has TEST_FAIL.
There are 12 rows and all rows with NA p-value have TEST_FAIL.
chromosome base_pair_location effect_allele other_allele beta standard_error effect_allele_frequency p_value ID INFO n TEST CHISQ EXTRA
2464756 10 68424514 C A 10.3522 1.95600 0.000222 NaN 10:68424514:A:C 1 458440 ADD NaN TEST_FAIL
13963875 13 16962399 A T 11.3097 3.52370 0.000028 NaN 13:16962399:T:A 1 452061 ADD NaN TEST_FAIL
17656265 14 27827504 C T 11.6465 3.57650 0.000037 NaN 14:27827504:T:C 1 458440 ADD NaN TEST_FAIL
23377501 16 656644 C G 12.3018 2.58835 0.000131 NaN 16:656644:G:C 1 458440 ADD NaN TEST_FAIL
23385383 16 797674 T C 11.9224 3.61427 0.000034 NaN 16:797674:C:T 1 458440 ADD NaN TEST_FAIL
41033987 1 211709313 A T 12.2723 3.66103 0.000029 NaN 1:211709313:T:A 1 458440 ADD NaN TEST_FAIL
42454826 20 3817725 G A 11.8014 3.59597 0.000040 NaN 20:3817725:A:G 1 458440 ADD NaN TEST_FAIL
63075343 4 13205321 T C 10.7092 3.44429 0.000036 NaN 4:13205321:C:T 1 458440 ADD NaN TEST_FAIL
67069635 4 131313744 C T 13.0730 2.66293 0.000134 NaN 4:131313744:T:C 1 458440 ADD NaN TEST_FAIL
67069916 4 131322541 G A 13.0730 2.66293 0.000134 NaN 4:131322541:A:G 1 458440 ADD NaN TEST_FAIL
67069918 4 131322685 C T 20.1949 3.26687 0.000038 NaN 4:131322685:T:C 1 458440 ADD NaN TEST_FAIL
67077655 4 131542946 A G 20.4077 3.28253 0.000029 NaN 4:131542946:G:A 1 458440 ADD NaN TEST_FAIL
Restarted formatting of Quant and Binary studies.
I was able to sync corrupt files but during merge step I've encountered more.
Job codon-slurm.33194820: cp_ukb_aws_gwas_summary_stats_corrupt Began
Job codon-slurm.33194820: cp_ukb_aws_gwas_summary_stats_corrupt Ended
Job codon-slurm.33198144: merge-corrupt Began
Job codon-slurm.33198144: merge-corrupt Ended -- I see no errors this time.
Restarted formatting of Binary studies. Expect them in /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats/
in 2 days. Also, quant studies in /hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats_quant/
Some studies' formatting failed again due to time limit again. Restarting this time with 48h. Please find the updated code at /hps/software/users/parkinso/spot/gwas/dev/gwas-sumstats-tools
or use format-long
conda env.
After formatting, for gwas_summary_stats
files, there are 4 error files.
-rw-rw-r-- 1 spotbot spot 118828 Jul 19 14:07 slurm-36124917.err - wil ask submitter to upload their chr13 file again
-rw-rw-r-- 1 spotbot spot 10 Jul 19 14:08 slurm-36124646.err - FIXED
-rw-rw-r-- 1 spotbot spot 116561 Jul 19 14:42 slurm-36124844.err - will ask submitter to upload their chr5 file again
-rw-rw-r-- 1 spotbot spot 10 Jul 19 14:55 slurm-36124522.err - FIXED
Working on formatting them manually after fixing the issues with the files.
Data associated with this project https://www.medrxiv.org/content/10.1101/2023.12.06.23299426v1 has been shared with Open Targets and needs ingesting into the GWAS Catalog. The data is presented in separate files for each chromosome, looks like ~35M variants per GWAS.
/hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats_quant/
/hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/gwas_summary_stats/
/hps/nobackup/parkinso/spot/gwas/scratch/goci-1267/aws/formatted_long/
TEST_FAIL
in the EXTRA column - SLURM jobs done but there are errors e.g. files with some chrs missing:m06_AFR
,m54_SAS
,m77_ASJ
aws/formatted_long/
@earlEBI can provide support in interpreting the template and especially with the template wrangling and submission steps.