CAST-genomics / haptools

Ancestry and haplotype aware simulation of genotypes and phenotypes for complex trait analysis
https://haptools.readthedocs.io
MIT License
18 stars 4 forks source link

feat: implement a new `validate` command #220

Open Ayimany opened 1 year ago

Ayimany commented 1 year ago

Resolves #47

Overview

This branch introduces the functionality of a new sub-command meant to validate the structure of .hap files.

Usage and Documentation

It requires no new dependencies.

It can be invoked through: haptools validate [--sorted/--not-sorted] [--genotypes <filename.pgen>] [--verbosity] <filename.hap>

Implementation Details

The code in this implementation is concentrated in the haptools/val_hapfile.py module.

The classes and functions that make up this module are the following:

HapFileIO (Class)

This class consists of a set of methods made to validate the existence and readability of the provided .hap file. It also cleans up the file prior to reading it's content.

It requires a filename and an optional logger. The filename should be a .hap file.

The main method that should be used to verify if the file exists is HapFileIO::validate_existence. It will check for the following:

click arguments already check for existence as well.

The method that should be used to filter and read the content within the .hap file is HapFileIO::lines. It takes in a filename as a Path and a boolean sorted to determine if the file should be sorted. It will:

Sorting actually occurs wether sorted is True or not. What sorted does is remove unnecessary lines based on their positions which would otherwise be necessary if sorted

Line (Class)

Stores information about a line, such as it's number and content for future use.

HapFileValidator (Class)

The validator will have to use a HapFileIO as its main source for reading the file's content. This is done through a method and not the constructor. (Which only accepts a logger)

To load the data into the validator, use HapFileValidator::extract_and_store_content which takes in the HapFileIO to be used and and an optional boolean to determine whether the file should be sorted or not.

All of the following methods in the class test for different aspects of the .hap file.

validate_version_declarations Determines if the version declaration is present, repeated or invalid

validate_column_additions Determines if the extra column declarations are correctly formatted and well-formed. If so, they are added to the list of registered extra columns

validate_columns_fulfill_minreqs Validates if all columns fulfill the minimum requirements

validate_haplotypes Validates the haplotype row format

validate_repeats Validates the repeat row format

validate_variants Validates the variant row format

store_ids Stores the IDs of each haplotype, repeat and variant for future use. Should be called before any ID validation methods

validate_variant_ids Validates the ID presence of each variant. Each needs to be unique per haplotype and not collide with chromosome IDs

validate_extra_fields Makes sure that the added extra fields conform to their addition signature

reorder_extra_fields Parses the order[H|R|V] lines and reorders the extra fields if they are valid

compare_haps_to_pvar Compares the variants in the .hap file to those in the .pvar file

is_hapfile_valid (Function)

Performs all of the possible checks available in the HapFileValidator class. Returns a boolean which is True when there are no errors or warnings

Tests

Only one case hasn't been fully tested due to OS limitations and the way they handle their file permissions. I am talking about validating whether the user has enough permissions to read the .hap file.


test_generated_haplotypes Tests the dummy .hap generated by the haptools test suite

test_with_empty_lines Tests a .hap with empty lines

test_with_out_of_header_metas_sorted Test a sorted .hap with meta lines out of the header

test_with_out_of_header_metas_unsorted Test an unsorted .hap with meta lines out of the header

test_with_10_extras_reordered Tests a .hap file with 10 extra columns

test_with_unexistent_reorders Tests a .hap with an order[H|R|V] which mentions a non-existent extra column

test_with_unexistent_fields Tests a .hap with a data line that is not an H, R or V

test_with_inadequate_version Tests a .hap with an incorrectly formatted version

test_with_no_version Tests a .hap with no present version

test_with_multiple_versions Tests a .hap with several versions present

test_with_inadequate_version_columns Tests a .hap with a version column of only 2 fields

test_with_invalid_column_addition_column_count Tests a .hap with an extra column declaration of invalid column count

test_with_invalid_column_addition_types Tests a .hap with a column addition for a type which is not H, R or V

test_with_invalid_column_addition_data_types Tests a .hap with a column addition of unrecognized data type (not s, d or .nf)

test_with_insufficient_columns Tests a .hap with insufficient mandatory columns

test_with_inconvertible_starts Tests a .hap with start positions that can't be converted to integers

test_with_inconvertible_ends Tests a .hap with end positions that can't be converted to integers

test_with_inconvertible_starts_var Tests a .hap with start positions that can't be converted to integers in variants

test_with_inconvertible_ends_var Tests a .hap with end positions that can't be converted to integers in variants

test_valhap_with_start_after_end Tests a .hap with the start position placed after the end position

test_is_directory Tests a validation command with a filename that points to a directory

test_with_variant_id_of_chromosome Tests a .hap with a variant whose ID is the same as a chromosome ID

test_with_hrid_of_chromosome Tests a .hap with a haplotype or repeat with the same ID as a chromosome

test_with_unexistent_col_in_order Tests a .hap with an order[H|R|V] field that references a non-existent extra column name

test_with_unassociated_haplotype Tests a .hap with a haplotype that does not have at least one matching repeat

test_with_unrecognizable_allele Tests a .hap with a variant whose allele is not G, C, T or A

test_with_duplicate_ids Tests a .hap with duplicate IDs for H and R fields

test_with_duplicate_vids_per_haplotype Tests a .hap with duplicate IDs for variants with the same haplotype association

test_with_excol_of_wrong_type Tests a .hap with a data line which contains an extra column of d data type but receives s

test_with_multiple_order_defs Tests a .hap with multiple order[H|R|V] of the same type

test_with_insufficient_excols_in_reorder Tests a .hap with an order[H|R|V] that does not reference all extra columns

test_with_variant_inexistent_haplotype_id Tests a .hap with with a variant that references a non-existent haplotype

test_with_missing_variant_in_pvar Tests a .hap along with a .pvar file which is missing an ID present in the .hap

test_unreadable_hapfile Passes a non-existent file to the validator


Future work

It would be wise to document the code further on.

Looking towards developing optimizations for this command would be of great help too although we should evaluate how frequently this command is to be used and how big the input files usually are in order to determine the severity of this issue.

Checklist

aryarm commented 1 year ago

mad props, @Ayimany ! This is a very well written PR. The code is super clean and easy to follow. I'm excited to try it out and will let you know once I finish reviewing

Thanks again for doing this! This will tremendously help many users of haptools