EBISPOT / gwas-sumstats-tools

gwas-sumstats-tools
https://ebispot.github.io/gwas-sumstats-tools-documentation/#/
Apache License 2.0
8 stars 2 forks source link

[Bug]: Metadata validator crashes when gwas_id is not inferred from filename #46

Open teague-23andme opened 1 week ago

teague-23andme commented 1 week ago

System information

Description of the Issue

The format command with --generate-metadata crashes for a filename that doesn't contain GCST, even if the metadata is otherwise valid due to attempting to concatenate a string and None.

Creating a symlink to the file with a GCST name processes the metadata (more or less) as expected, except that it adds the GWAS Catalog IDs.

Calling the format command with a GCST filename that doesn't exist, still processes and writes the metadata file.

Ideally, gwas_id and gwas_catalog_api shouldn't be forced to be inferred for files they are not required of.

Error Message

---------- METADATA ----------

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/jovyan/env/lib/python3.12/site-packages/gwas_sumstats_tools/cli.py:188 in ss_format        │
│                                                                                                  │
│   185 │   │   if custom_header_map else {}                                                       │
│   186 │   meta_dict = metadata_dict_from_args(args=extra_args.args) \                            │
│   187 │   │   if metadata_edit_mode else {}                                                      │
│ ❱ 188 │   format(filename=filename,                                                              │
│   189 │   │      data_outfile=data_outfile,                                                      │
│   190 │   │      minimal_to_standard=minimal_to_standard,                                        │
│   191 │   │      generate_metadata=generate_metadata,                                            │
│                                                                                                  │
│ ╭──────────────────────────────── locals ────────────────────────────────╮                       │
│ │      custom_header_map = False                                         │                       │
│ │           data_outfile = None                                          │                       │
│ │             extra_args = <click.core.Context object at 0x7f1629176b70> │                       │
│ │               filename = PosixPath('output.tsv.gz')                    │                       │
│ │      generate_metadata = True                                          │                       │
│ │             header_map = {}                                            │                       │
│ │              meta_dict = {}                                            │                       │
│ │     metadata_edit_mode = False                                         │                       │
│ │ metadata_from_gwas_cat = False                                         │                       │
│ │        metadata_infile = PosixPath('minimal.yaml')                     │                       │
│ │       metadata_outfile = PosixPath('generated.yaml')                   │                       │
│ │    minimal_to_standard = False                                         │                       │
│ ╰────────────────────────────────────────────────────────────────────────╯                       │
│                                                                                                  │
│ /home/jovyan/env/lib/python3.12/site-packages/gwas_sumstats_tools/format.py:144 in format        │
│                                                                                                  │
│   141 │   # Get metadata                                                                         │
│   142 │   if generate_metadata:                                                                  │
│   143 │   │   print("[bold]\n---------- METADATA ----------\n[/bold]")                           │
│ ❱ 144 │   │   metadata = formatter.set_metadata(                                                 │
│   145 │   │   │   from_gwas_cat=metadata_from_gwas_cat, custom_metadata=metadata_dict            │
│   146 │   │   )                                                                                  │
│   147 │   │   print(metadata)                                                                    │
│                                                                                                  │
│ ╭───────────────────────────────────────── locals ─────────────────────────────────────────╮     │
│ │           data_outfile = None                                                            │     │
│ │               filename = PosixPath('output.tsv.gz')                                      │     │
│ │              formatter = <gwas_sumstats_tools.format.Formatter object at 0x7f16289d3e60> │     │
│ │      generate_metadata = True                                                            │     │
│ │             header_map = {}                                                              │     │
│ │          metadata_dict = {}                                                              │     │
│ │ metadata_from_gwas_cat = False                                                           │     │
│ │        metadata_infile = PosixPath('minimal.yaml')                                       │     │
│ │       metadata_outfile = PosixPath('generated.yaml')                                     │     │
│ │    minimal_to_standard = False                                                           │     │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────╯     │
│                                                                                                  │
│ /home/jovyan/env/lib/python3.12/site-packages/gwas_sumstats_tools/format.py:88 in set_metadata   │
│                                                                                                  │
│    85 │   │   │   metadata object                                                                │
│    86 │   │   """                                                                                │
│    87 │   │   self.meta.from_file()                                                              │
│ ❱  88 │   │   meta_dict = get_file_metadata(                                                     │
│    89 │   │   │   in_file=self.data_infile,                                                      │
│    90 │   │   │   out_file=self.data_outfile,                                                    │
│    91 │   │   │   meta_dict=self.meta.as_dict(),                                                 │
│                                                                                                  │
│ ╭───────────────────────────────────── locals ──────────────────────────────────────╮            │
│ │ custom_metadata = {}                                                              │            │
│ │   from_gwas_cat = False                                                           │            │
│ │            self = <gwas_sumstats_tools.format.Formatter object at 0x7f16289d3e60> │            │
│ ╰───────────────────────────────────────────────────────────────────────────────────╯            │
│                                                                                                  │
│ /home/jovyan/env/lib/python3.12/site-packages/gwas_sumstats_tools/interfaces/metadata.py:186 in  │
│ get_file_metadata                                                                                │
│                                                                                                  │
│   183 │   inferred_meta_dict['genome_assembly'] = GENOME_ASSEMBLY_MAPPINGS.get(parse_genome_as   │
│   184 │   inferred_meta_dict['data_file_md5sum'] = get_md5sum(out_file) if Path(out_file).exis   │
│   185 │   inferred_meta_dict['date_last_modified'] = date.today()                                │
│ ❱ 186 │   inferred_meta_dict['gwas_catalog_api'] = GWAS_CAT_API_STUDIES_URL + parse_accession_   │
│   187 │   for field, value in inferred_meta_dict.items():                                        │
│   188 │   │   update_dict_if_not_set(meta_dict, field, value)                                    │
│   189 │   return meta_dict                                                                       │
│                                                                                                  │
│ ╭────────────────────────────────────── locals ───────────────────────────────────────╮          │
│ │            in_file = PosixPath('output.tsv.gz')                                     │          │
│ │ inferred_meta_dict = {                                                              │          │
│ │                      │   'gwas_id': None,                                           │          │
│ │                      │   'data_file_name': 'output.tsv.gz',                         │          │
│ │                      │   'file_type': 'GWAS-SFF v1.0',                              │          │
│ │                      │   'genome_assembly': 'unknown',                              │          │
│ │                      │   'data_file_md5sum': '7e29306421cfb296a5e1099f2e461390',    │          │
│ │                      │   'date_last_modified': datetime.date(2024, 11, 6)           │          │
│ │                      }                                                              │          │
│ │          meta_dict = {                                                              │          │
│ │                      │   'genotyping_technology': [                                 │          │
│ │                      │   │   'Genome-wide genotyping array'                         │          │
│ │                      │   ],                                                         │          │
│ │                      │   'gwas_id': None,                                           │          │
│ │                      │   'trait_description': None,                                 │          │
│ │                      │   'minor_allele_freq_lower_limit': None,                     │          │
│ │                      │   'data_file_name': 'output.tsv.gz',                         │          │
│ │                      │   'file_type': 'GWAS-SSF v1.0',                              │          │
│ │                      │   'data_file_md5sum': None,                                  │          │
│ │                      │   'is_harmonised': False,                                    │          │
│ │                      │   'is_sorted': False,                                        │          │
│ │                      │   'date_last_modified': datetime.date(2024, 11, 6),          │          │
│ │                      │   ... +12                                                    │          │
│ │                      }                                                              │          │
│ │           out_file = PosixPath('output.tsv.gz')                                     │          │
│ ╰─────────────────────────────────────────────────────────────────────────────────────╯          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: can only concatenate str (not "NoneType") to str```

### Command used and terminal output

```console
$ gwas-ssf format empty.tsv.gz --meta-in minimal.yaml --meta-out generated.yaml --generate-metadata
...
# Crashes with the message above
TypeError: can only concatenate str (not "NoneType") to str

# However simply calling the validator with a symlink to the same file works
$ ln -s emtpy.tsv GCST1.tsv
$ gwas-ssf format GCST1.tsv.gz --meta-in minimal.yaml --meta-out generated.yaml --generate-metadata

---------- METADATA ----------

adjusted_covariates:
- age
- sex
analysis_software: PLINK 1.9
author_notes: Example
coordinate_system: 1-based
data_file_md5sum: 05eea3e7b985d4f552fcec50c102bed8
data_file_name: output.tsv.gz
date_last_modified: 2024-11-06
file_type: GWAS-SSF v1.0
genome_assembly: GRCh37
genotyping_technology:
- Genome-wide genotyping array
gwas_catalog_api: https://www.ebi.ac.uk/gwas/rest/api/studies/GCST1
gwas_id: GCST1
harmonisation_reference: null
imputation_panel: 1000 Genomes Phase 3 (placeholder)
imputation_software: GENOTYPE
is_harmonised: false
is_sorted: false
minor_allele_freq_lower_limit: null
ontology_mapping: null
samples:
- ancestry_method:
  - self-reported
  - gentically determined
  case_control_study: false
  case_count: null
  control_count: null
  sample_ancestry: null
  sample_size: 1000
sex: combined
trait_description: null

Writing metadata --> generated.yaml

# Surprising, even if that file doesn't actually exist
$ gwas-ssf format GCST999999999999999.tsv --meta-in minimal.yaml --meta-out generated.yaml --generate-metadata
[Errno 2] No such file or directory: 'GCST999999999999999.tsv'

---------- METADATA ----------

adjusted_covariates:
- age
- sex
analysis_software: PLINK 1.9
author_notes: Example
coordinate_system: 1-based
data_file_md5sum: null
data_file_name: output.tsv.gz
date_last_modified: 2024-11-06
file_type: GWAS-SSF v1.0
genome_assembly: GRCh37
genotyping_technology:
- Genome-wide genotyping array
gwas_catalog_api: https://www.ebi.ac.uk/gwas/rest/api/studies/GCST999999999999999
gwas_id: GCST999999999999999
harmonisation_reference: null
imputation_panel: 1000 Genomes Phase 3 (placeholder)
imputation_software: GENOTYPE
is_harmonised: false
is_sorted: false
minor_allele_freq_lower_limit: null
ontology_mapping: null
samples:
- ancestry_method:
  - self-reported
  - gentically determined
  case_control_study: false
  case_count: null
  control_count: null
  sample_ancestry: null
  sample_size: 1000
sex: combined
trait_description: null

Writing metadata --> generated.yaml
$

First 10 Rows of the Input File

empty.tsv:

chromosome      base_pair_location      effect_allele   other_allele    beta    standard_error  p_value variant_id      ref_allele

minimal.yaml:


adjusted_covariates:
- age
- sex
analysis_software: PLINK 1.9
author_notes: Example
coordinate_system: 1-based
data_file_name: output.tsv.gz
date_last_modified: 2024-11-06
file_type: GWAS-SSF v1.0
genome_assembly: GRCh37
genotyping_technology:
- Genome-wide genotyping array
imputation_panel: 1000 Genomes Phase 3 (placeholder)
imputation_software: GENOTYPE
is_harmonised: false
is_sorted: false
samples:
- ancestry_method:
  - self-reported
  - gentically determined
  case_control_study: false
  sample_size: 1000
sex: combined```

### Relevant files

_No response_
jiyue1214 commented 1 week ago

Hi @teague-23andme,

Thank you for using gwas-sumstats-tools and for reporting the issue you encountered.

gwas-sumstat-tools is designed to format GWAS summary statistic data not originally in the GWAS-SSF format into the correct format and to validate that an input file meets the GWAS-SSF standards before submission to the GWAS Catalog.

The metadata generation function currently has a primary focus on internal use. It retrieves metadata via our REST API or internal ingest API via GCST ID and creates the YAML file. GCST number is essential for this purpose.

I noticed you're using version v1.0.5, one of our earlier releases, and I hope the combination of the gwas-ssf format and --generate-metadata options wasn’t confusing.

Could you share more about your specific use case? This will help us understand if the latest release of gwas-sumstats-tools could better support your needs.