Better AnVIL VCF validation

hanars commented 2 years ago

Add some genome build/ sample type/ chromosome validation at loading request time, instead of having it fail in the pipeline itself.

May need to wait for hail backend/ better ability to quickly run hail in seqr

ShifaSZ commented 2 years ago

It seems to me that there is no genome build/sample type information in a VCF file. It does have a chromosome for each variant. But we don't check the chromosome (or contig) while loading.

hanars commented 2 years ago

There is code in the loading pipeline that does sample type and genome build validation. I was hoping we could abstract out that code and have it run directly in seqr, but we need to do some investigation to see if it is performant enough to run in under 2 minutes, which it may not be. Per the ticket description:

May need to wait for hail backend/ better ability to quickly run hail in seqr

ShifaSZ commented 2 years ago

I ran the VCF validation in a Jupyter Notebook on my MacBook. The time was almost 2 minutes. It could be faster on the seqr server but won't be too quick after adding frontend/backend communication time.

The validation code is as below.

GRCh37_STANDARD_CONTIGS = {'1','10','11','12','13','14','15','16','17','18','19','2','20','21','22','3','4','5','6','7','8','9','X','Y', 'MT'}
GRCh38_STANDARD_CONTIGS = {'chr1','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18','chr19','chr2','chr20','chr21','chr22','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chrX','chrY', 'chrM'}
OPTIONAL_CHROMOSOMES = ['MT', 'chrM', 'Y', 'chrY']
VARIANT_THRESHOLD = 100
CONST_GRCh37 = '37'
CONST_GRCh38 = '38'
GLOBAL_CONFIG = {
    'validation_37_noncoding_ht': 'gs://seqr-reference-data/GRCh37/validate_ht/common_noncoding_variants.grch37.ht',
    'validation_37_coding_ht': 'gs://seqr-reference-data/GRCh37/validate_ht/common_coding_variants.grch37.ht',
    'validation_38_noncoding_ht': 'gs://seqr-reference-data/GRCh38/validate_ht/common_noncoding_variants.grch38.ht',
    'validation_38_coding_ht': 'gs://seqr-reference-data/GRCh38/validate_ht/common_coding_variants.grch38.ht',
}

class SeqrValidationError(Exception):
    pass

def import_vcf(genome_version, source_paths):
    # Import the VCFs from inputs. Set min partitions so that local pipeline execution takes advantage of all CPUs.
    recode = {}
    if genome_version == "38":
        recode = {f"{i}": f"chr{i}" for i in (list(range(1, 23)) + ['X', 'Y'])}
    elif self.genome_version == "37":
        recode = {f"chr{i}": f"{i}" for i in (list(range(1, 23)) + ['X', 'Y'])}

    return hl.import_vcf([vcf_file for vcf_file in source_paths],
                             reference_genome='GRCh' + genome_version,
                             skip_invalid_loci=True,
                             contig_recoding=recode,
                             force_bgz=True, min_partitions=500)

def get_sample_type_stats(mt, genome_version, threshold=0.3):
    """
    Calculate stats for sample type by checking against a list of common coding and non-coding variants.
    If the match for each respective type is over the threshold, we return a match.

    :param mt: Matrix Table to check
    :param genome_version: reference genome version
    :param threshold: if the matched percentage is over this threshold, we classify as match
    :return: a dict of coding/non-coding to dict with 'matched_count', 'total_count' and 'match' boolean.
    """
    stats = {}
    types_to_ht_path = {
        'noncoding': GLOBAL_CONFIG['validation_%s_noncoding_ht' % genome_version],
        'coding': GLOBAL_CONFIG['validation_%s_coding_ht' % genome_version]
    }
    for sample_type, ht_path in types_to_ht_path.items():
        ht = hl.read_table(ht_path)
        stats[sample_type] = ht_stats = {
            'matched_count': mt.semi_join_rows(ht).count_rows(),
            'total_count': ht.count(),

        }
        ht_stats['match'] = (ht_stats['matched_count'] / ht_stats['total_count']) >= threshold
    return stats

def contig_check(mt, standard_contigs, threshold):
    check_result_dict = {}

    # check chromosomes that are not in the VCF  
    row_dict = mt.aggregate_rows(hl.agg.counter(mt.locus.contig))
    contigs_set = set(row_dict.keys())

    all_missing_contigs = standard_contigs - contigs_set
    missing_contigs_without_optional = [contig for contig in all_missing_contigs if contig not in OPTIONAL_CHROMOSOMES]

    if missing_contigs_without_optional:
        check_result_dict['Missing contig(s)'] = missing_contigs_without_optional
        logger.warning('Missing the following chromosomes(s):{}'.format(', '.join(missing_contigs_without_optional)))

    for k,v in row_dict.items():
        if k not in standard_contigs:
            check_result_dict.setdefault('Unexpected chromosome(s)',[]).append(k)
            logger.warning('Chromosome %s is unexpected.', k)
        elif (k not in OPTIONAL_CHROMOSOMES) and (v < threshold):
            check_result_dict.setdefault(f'Chromosome(s) whose variants count under threshold {threshold}',[]).append(k)
            logger.warning('Chromosome %s has %d rows, which is lower than threshold %d.', k, v, threshold)

    return check_result_dict

def validate_mt(mt, genome_version, sample_type):
    """
    Validate the mt by checking against a list of common coding and non-coding variants given its
    genome version. This validates genome_version, variants, and the reported sample type.

    :param mt: mt to validate
    :param genome_version: reference genome version
    :param sample_type: WGS or WES
    :return: True or Exception
    """
    if genome_version == CONST_GRCh37:
        contig_check_result = contig_check(mt, GRCh37_STANDARD_CONTIGS, VARIANT_THRESHOLD)
    elif genome_version == CONST_GRCh38:
        contig_check_result = contig_check(mt, GRCh38_STANDARD_CONTIGS, VARIANT_THRESHOLD)

    if bool(contig_check_result):
        err_msg = ''
        for k,v in contig_check_result.items():
            err_msg += '{k}: {v}. '.format(k=k, v=', '.join(v))
        # raise SeqrValidationError(err_msg)
        print(err_msg)

    sample_type_stats = get_sample_type_stats(mt, genome_version)

    for name, stat in sample_type_stats.items():
        logger.info('Table contains %i out of %i common %s variants.' %
                    (stat['matched_count'], stat['total_count'], name))

    has_coding = sample_type_stats['coding']['match']
    has_noncoding = sample_type_stats['noncoding']['match']

    if not has_coding and not has_noncoding:
        # No common variants detected.
        # raise SeqrValidationError(
        print(
            'Genome version validation error: dataset specified as GRCh{genome_version} but doesn\'t contain '
            'the expected number of common GRCh{genome_version} variants'.format(genome_version=genome_version)
        )
    elif has_noncoding and not has_coding:
        # Non coding only.
        # raise SeqrValidationError(
        print(
            'Sample type validation error: Dataset contains noncoding variants but is missing common coding '
            'variants for GRCh{}. Please verify that the dataset contains coding variants.' .format(genome_version)
        )
    elif has_coding and not has_noncoding:
        # Only coding should be WES.
        if sample_type != 'WES':
            # raise SeqrValidationError(
            print(
                'Sample type validation error: dataset sample-type is specified as {} but appears to be '
                'WGS because it contains many common coding variants'.format(sample_type)
            )
    elif has_noncoding and has_coding:
        # Both should be WGS.
        if sample_type != 'WGS':
            # raise SeqrValidationError(
            print(
                'Sample type validation error: dataset sample-type is specified as {} but appears to be '
                'WES because it contains many common non-coding variants'.format(sample_type)
            )
    return True

hanars commented 2 years ago

Whats the runtime if we don't do the contig check, only the sample_type_stat checks for build and saple type validation?

ShifaSZ commented 2 years ago

The speed on my local computer is much slower (3 minutes without a contig check) today. I'll try it on a Google cloud engine.

hanars commented 2 years ago

Can you run both with and without the conting check on google cloud engine to make it easier to compare?

ShifaSZ commented 2 years ago

The results on a dataproc are as below.

validation time (including contig check): 0:00:41.325795
validation time (not including contig check): 0:00:14.798183

They look good.

hanars commented 2 years ago

Yeah those are very reasonable times, based on that we will want to take this validation live in seqr, but first we will need to get hail properly set up in the seqr deployment. I'm going to mark this ticket as blocked on the hail ticket, and we should hold off on working on this until that is done. Thanks for all your hard work looking into this!

ShifaSZ commented 2 years ago

The complete outputs while running the dataproc. Are the parameter settings okay?

python hail_scripts/validate_vcf/run_dataproc_validate_vcf.py
Cost: $0.95/h + $0.10 preemptible/h = $1.0472000000000001 / hour
gcloud beta dataproc clusters create vcf-validation         --region=us-central1         --max-idle=30m         --master-machine-type=n1-highmem-8          --master-boot-disk-size=100GB         --num-workers=2         --num-secondary-workers=1         --secondary-worker-boot-disk-size=40GB         --worker-machine-type=n1-highmem-8         --worker-boot-disk-size=40GB         --image-version=2.0.29-debian10         --metadata=WHEEL=gs://hail-common/hailctl/dataproc/0.2.85/hail-0.2.85-py3-none-any.whl,PKGS=aiohttp==3.7.4\|aiohttp_session==2.7.0\|asyncinit==0.2.4\|avro==1.10.2\|bokeh==1.4.0\|boto3==1.21.28\|botocore==1.24.28\|decorator==4.4.2\|Deprecated==1.2.12\|dill==0.3.3\|gcsfs==2021.11.1\|google-auth==1.27.0\|google-cloud-storage==1.25.0\|humanize==1.0.0\|hurry.filesize==0.9\|janus==0.6.2\|nest_asyncio==1.5.4\|numpy==1.20.1\|orjson==3.6.4\|pandas==1.3.5\|parsimonious==0.8.1\|plotly==5.5.0\|PyJWT\|python-json-logger==0.1.11\|requests==2.25.1\|scipy==1.6.1\|sortedcontainers==2.1.0\|tabulate==0.8.3\|tqdm==4.42.1\|uvloop==0.16.0\|luigi\|google-api-python-client\|httplib2==0.19.1\|pyparsing==2.4.7         --properties=dataproc:dataproc.cluster-ttl.consider-yarn-activity=false,spark:spark.driver.memory=41g,spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,hdfs:dfs.replication=1         --initialization-actions=gs://hail-common/hailctl/dataproc/0.2.85/init_notebook.py

ERROR: (gcloud.beta.dataproc.clusters.create) ALREADY_EXISTS: Already exists: Failed to create cluster: Cluster projects/seqr-project/regions/us-central1/clusters/vcf-validation
updating: hail_scripts/ (stored 0%)
updating: hail_scripts/.DS_Store (deflated 96%)
updating: hail_scripts/__init__.py (stored 0%)
updating: hail_scripts/utils/ (stored 0%)
updating: hail_scripts/utils/clinvar.py (deflated 59%)
updating: hail_scripts/utils/hail_utils.py (deflated 67%)
updating: hail_scripts/utils/__init__.py (stored 0%)
updating: hail_scripts/utils/__pycache__/ (stored 0%)
updating: hail_scripts/utils/__pycache__/__init__.cpython-38.pyc (deflated 22%)
updating: hail_scripts/utils/__pycache__/clinvar.cpython-37.pyc (deflated 40%)
updating: hail_scripts/utils/__pycache__/hail_utils.cpython-38.pyc (deflated 48%)
updating: hail_scripts/utils/__pycache__/hail_utils.cpython-37.pyc (deflated 47%)
updating: hail_scripts/utils/__pycache__/__init__.cpython-37.pyc (deflated 20%)
updating: hail_scripts/utils/__pycache__/clinvar.cpython-38.pyc (deflated 41%)
updating: hail_scripts/shared/ (stored 0%)
updating: hail_scripts/shared/__pycache__/ (stored 0%)
updating: hail_scripts/shared/__pycache__/elasticsearch_client_v7.cpython-37.pyc (deflated 57%)
updating: hail_scripts/shared/__pycache__/elasticsearch_utils.cpython-37.pyc (deflated 39%)
updating: hail_scripts/shared/__pycache__/__init__.cpython-37.pyc (deflated 22%)
updating: hail_scripts/__pycache__/ (stored 0%)
updating: hail_scripts/__pycache__/__init__.cpython-38.pyc (deflated 24%)
updating: hail_scripts/__pycache__/__init__.cpython-37.pyc (deflated 22%)
updating: hail_scripts/validate_vcf/ (stored 0%)
updating: hail_scripts/validate_vcf/run_dataproc_validate_vcf.py (deflated 48%)
updating: hail_scripts/validate_vcf/__init__.py (stored 0%)
updating: hail_scripts/update_models/ (stored 0%)
updating: hail_scripts/update_models/update_mt_schema.py (deflated 71%)
updating: hail_scripts/update_models/__init__.py (stored 0%)
updating: hail_scripts/v02/ (stored 0%)
updating: hail_scripts/v02/utils/ (stored 0%)
updating: hail_scripts/v02/utils/__pycache__/ (stored 0%)
updating: hail_scripts/v02/utils/__pycache__/elasticsearch_client.cpython-37.pyc (deflated 57%)
updating: hail_scripts/v02/utils/__pycache__/elasticsearch_utils.cpython-37.pyc (deflated 51%)
updating: hail_scripts/v02/utils/__pycache__/__init__.cpython-37.pyc (deflated 20%)
updating: hail_scripts/v02/__pycache__/ (stored 0%)
updating: hail_scripts/v02/__pycache__/__init__.cpython-37.pyc (deflated 21%)
updating: hail_scripts/computed_fields/ (stored 0%)
updating: hail_scripts/computed_fields/test_variant_id.py (deflated 59%)
updating: hail_scripts/computed_fields/flags.py (deflated 82%)
updating: hail_scripts/computed_fields/__init__.py (deflated 38%)
updating: hail_scripts/computed_fields/vep.py (deflated 75%)
updating: hail_scripts/computed_fields/__pycache__/ (stored 0%)
updating: hail_scripts/computed_fields/__pycache__/__init__.cpython-38.pyc (deflated 21%)
updating: hail_scripts/computed_fields/__pycache__/flags.cpython-37.pyc (deflated 76%)
updating: hail_scripts/computed_fields/__pycache__/variant_id.cpython-38.pyc (deflated 52%)
updating: hail_scripts/computed_fields/__pycache__/vep.cpython-38.pyc (deflated 60%)
updating: hail_scripts/computed_fields/__pycache__/variant_id.cpython-37.pyc (deflated 53%)
updating: hail_scripts/computed_fields/__pycache__/vep.cpython-37.pyc (deflated 60%)
updating: hail_scripts/computed_fields/__pycache__/flags.cpython-38.pyc (deflated 75%)
updating: hail_scripts/computed_fields/__pycache__/__init__.cpython-37.pyc (deflated 19%)
updating: hail_scripts/computed_fields/test_flags.py (deflated 87%)
updating: hail_scripts/computed_fields/variant_id.py (deflated 69%)
updating: hail_scripts/elasticsearch/ (stored 0%)
updating: hail_scripts/elasticsearch/elasticsearch_client_v7.py (deflated 71%)
updating: hail_scripts/elasticsearch/__init__.py (stored 0%)
updating: hail_scripts/elasticsearch/__pycache__/ (stored 0%)
updating: hail_scripts/elasticsearch/__pycache__/__init__.cpython-38.pyc (deflated 27%)
updating: hail_scripts/elasticsearch/__pycache__/hail_elasticsearch_client.cpython-37.pyc (deflated 57%)
updating: hail_scripts/elasticsearch/__pycache__/elasticsearch_utils.cpython-38.pyc (deflated 49%)
updating: hail_scripts/elasticsearch/__pycache__/elasticsearch_client_v7.cpython-37.pyc (deflated 57%)
updating: hail_scripts/elasticsearch/__pycache__/elasticsearch_client_v7.cpython-38.pyc (deflated 56%)
updating: hail_scripts/elasticsearch/__pycache__/elasticsearch_utils.cpython-37.pyc (deflated 48%)
updating: hail_scripts/elasticsearch/__pycache__/__init__.cpython-37.pyc (deflated 26%)
updating: hail_scripts/elasticsearch/__pycache__/hail_elasticsearch_client.cpython-38.pyc (deflated 57%)
updating: hail_scripts/elasticsearch/hail_elasticsearch_client.py (deflated 72%)
updating: hail_scripts/elasticsearch/elasticsearch_utils.py (deflated 70%)
updating: hail_scripts/elasticsearch/elasticsearch_utils_tests.py (deflated 67%)
  adding: hail_scripts/validate_vcf/validate_vcf.py (deflated 71%)
gcloud dataproc jobs submit pyspark       --cluster=vcf-validation       --py-files=/var/folders/p8/c2yjwplx5n5c8z8s5c91ddqc0000gq/T/hail_scripts.zip              --region=us-central1       --id=vcf_validation_20220819-1011              "hail_scripts/validate_vcf/validate_vcf.py" -- "--use-dataproc"

/Users/shifa/dev/hail_elasticsearch_pipelines
Job [vcf_validation_20220819-1011] submitted.
Waiting for job output...
Initializing Hail with default parameters...
2022-08-19 14:11:58 INFO  SparkContext:57 - Running Spark version 3.1.2
2022-08-19 14:11:59 INFO  ResourceUtils:57 - ==============================================================
2022-08-19 14:11:59 INFO  ResourceUtils:57 - No custom resources configured for spark.driver.
2022-08-19 14:11:59 INFO  ResourceUtils:57 - ==============================================================
2022-08-19 14:11:59 INFO  SparkContext:57 - Submitted application: Hail
2022-08-19 14:11:59 INFO  SparkContext:57 - Spark configuration:
spark.app.name=Hail
spark.app.startTime=1660918318971
spark.driver.extraClassPath=/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar
spark.driver.extraJavaOptions=-Xss4M
spark.driver.maxResultSize=0
spark.driver.memory=41g
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.maxExecutors=10000
spark.dynamicAllocation.minExecutors=1
spark.eventLog.dir=gs://dataproc-temp-us-central1-733952080251-pwl2itzn/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/spark-job-history
spark.eventLog.enabled=true
spark.executor.cores=4
spark.executor.extraClassPath=./hail-all-spark.jar
spark.executor.extraJavaOptions=-Xss4M
spark.executor.instances=2
spark.executor.memory=21840m
spark.executorEnv.OPENBLAS_NUM_THREADS=1
spark.executorEnv.PYTHONHASHSEED=0
spark.extraListeners=com.google.cloud.spark.performance.DataprocMetricsListener
spark.hadoop.hive.execution.engine=mr
spark.hadoop.io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,is.hail.io.compress.BGzipCodec,is.hail.io.compress.BGzipCodecTbi,org.apache.hadoop.io.compress.GzipCodec
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
spark.hadoop.mapreduce.input.fileinputformat.split.minsize=0
spark.history.fs.logDirectory=gs://dataproc-temp-us-central1-733952080251-pwl2itzn/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/spark-job-history
spark.jars=file:/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar
spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator
spark.kryoserializer.buffer.max=1g
spark.logConf=true
spark.master=yarn
spark.metrics.namespace=app_name:${spark.app.name}.app_id:${spark.app.id}
spark.repl.local.jars=file:///opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar
spark.rpc.message.maxSize=512
spark.scheduler.minRegisteredResourcesRatio=0.0
spark.scheduler.mode=FAIR
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled=true
spark.sql.adaptive.enabled=true
spark.sql.autoBroadcastJoinThreshold=163m
spark.sql.catalogImplementation=hive
spark.sql.cbo.enabled=true
spark.sql.cbo.joinReorder.enabled=true
spark.submit.deployMode=client
spark.submit.pyFiles=/tmp/vcf_validation_20220819-1011/hail_scripts.zip
spark.task.maxFailures=20
spark.ui.port=0
spark.ui.showConsoleProgress=false
spark.yarn.am.memory=640m
spark.yarn.dist.jars=file:///opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar
spark.yarn.dist.pyFiles=file:///tmp/vcf_validation_20220819-1011/hail_scripts.zip
spark.yarn.historyServer.address=vcf-validation-m:18080
spark.yarn.isPython=true
spark.yarn.jars=local:/usr/lib/spark/jars/*
spark.yarn.tags=dataproc_hash_69b2d16a-5a69-335e-b202-5381c4fcb4d3,dataproc_job_vcf_validation_20220819-1011,dataproc_master_index_0,dataproc_uuid_c000492d-8218-3275-9eab-7c558c1482c2
spark.yarn.unmanagedAM.enabled=true
2022-08-19 14:11:59 INFO  ResourceProfile:57 - Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 21840, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2022-08-19 14:11:59 INFO  ResourceProfile:57 - Limiting resource is cpus at 4 tasks per executor
2022-08-19 14:11:59 INFO  ResourceProfileManager:57 - Added ResourceProfile id: 0
2022-08-19 14:11:59 INFO  SecurityManager:57 - Changing view acls to: root
2022-08-19 14:11:59 INFO  SecurityManager:57 - Changing modify acls to: root
2022-08-19 14:11:59 INFO  SecurityManager:57 - Changing view acls groups to: 
2022-08-19 14:11:59 INFO  SecurityManager:57 - Changing modify acls groups to: 
2022-08-19 14:11:59 INFO  SecurityManager:57 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2022-08-19 14:11:59 INFO  Utils:57 - Successfully started service 'sparkDriver' on port 34589.
2022-08-19 14:11:59 INFO  SparkEnv:57 - Registering MapOutputTracker
2022-08-19 14:11:59 INFO  SparkEnv:57 - Registering BlockManagerMaster
2022-08-19 14:11:59 INFO  BlockManagerMasterEndpoint:57 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2022-08-19 14:11:59 INFO  BlockManagerMasterEndpoint:57 - BlockManagerMasterEndpoint up
2022-08-19 14:11:59 INFO  SparkEnv:57 - Registering BlockManagerMasterHeartbeat
2022-08-19 14:11:59 INFO  DiskBlockManager:57 - Created local directory at /hadoop/spark/tmp/blockmgr-00cfb18f-30b3-46ea-b6e4-4b5e54d8f485
2022-08-19 14:11:59 INFO  MemoryStore:57 - MemoryStore started with capacity 21.7 GiB
2022-08-19 14:11:59 INFO  SparkEnv:57 - Registering OutputCommitCoordinator
2022-08-19 14:11:59 INFO  log:169 - Logging initialized @5342ms to org.sparkproject.jetty.util.log.Slf4jLog
2022-08-19 14:11:59 INFO  Server:375 - jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_312-b07
2022-08-19 14:11:59 INFO  Server:415 - Started @5448ms
2022-08-19 14:11:59 INFO  AbstractConnector:331 - Started ServerConnector@2e001c12{HTTP/1.1, (http/1.1)}{0.0.0.0:33355}
2022-08-19 14:11:59 INFO  Utils:57 - Successfully started service 'SparkUI' on port 33355.
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@5c89ad61{/jobs,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@b5ed6c5{/jobs/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@636b0d01{/jobs/job,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@5f9b6a3a{/jobs/job/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@67ae0a7d{/stages,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@353e6d85{/stages/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@5db6ff34{/stages/stage,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@6da1eb6a{/stages/stage/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@b819c95{/stages/pool,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@4962f1c9{/stages/pool/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@3cb9d0f0{/storage,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@1ca446bf{/storage/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@9650802{/storage/rdd,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@457200aa{/storage/rdd/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@54210dac{/environment,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@134ac3ff{/environment/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@13d39293{/executors,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@574cde82{/executors/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@ac06593{/executors/threadDump,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@4c721bef{/executors/threadDump/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@11e3f5b{/static,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@6ef7ea64{/,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@8bf39b4{/api,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@375d18c{/jobs/job/kill,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@5c41043{/stages/stage/kill,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO  SparkUI:57 - Bound SparkUI to 0.0.0.0, and started at http://vcf-validation-m.c.seqr-project.internal:33355
2022-08-19 14:12:00 INFO  SparkContext:57 - Added JAR file:/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar at spark://vcf-validation-m.c.seqr-project.internal:34589/jars/hail-all-spark.jar with timestamp 1660918318971
2022-08-19 14:12:00 INFO  FairSchedulableBuilder:57 - Creating Fair Scheduler pools from default file: fairscheduler.xml
2022-08-19 14:12:00 INFO  FairSchedulableBuilder:57 - Created pool: default, schedulingMode: FAIR, minShare: 0, weight: 1
2022-08-19 14:12:00 INFO  Utils:57 - Using initial executors = 2, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
2022-08-19 14:12:00 INFO  RMProxy:134 - Connecting to ResourceManager at vcf-validation-m/10.128.0.5:8032
2022-08-19 14:12:00 INFO  AHSProxy:42 - Connecting to Application History server at vcf-validation-m/10.128.0.5:10200
2022-08-19 14:12:00 INFO  Client:57 - Requesting a new application from cluster with 3 NodeManagers
2022-08-19 14:12:01 INFO  Configuration:2795 - resource-types.xml not found
2022-08-19 14:12:01 INFO  ResourceUtils:442 - Unable to find 'resource-types.xml'.
2022-08-19 14:12:01 INFO  Client:57 - Verifying our application has not requested more than the maximum memory capability of the cluster (48048 MB per container)
2022-08-19 14:12:01 INFO  Client:57 - Will allocate AM container, with 1024 MB memory including 384 MB overhead
2022-08-19 14:12:01 INFO  Client:57 - Setting up container launch context for our AM
2022-08-19 14:12:01 INFO  Client:57 - Setting up the launch environment for our AM container
2022-08-19 14:12:01 INFO  Client:57 - Preparing resources for our AM container
2022-08-19 14:12:01 INFO  Client:57 - Uploading resource file:/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar -> hdfs://vcf-validation-m/user/root/.sparkStaging/application_1660917791379_0001/hail-all-spark.jar
2022-08-19 14:12:02 INFO  Client:57 - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://vcf-validation-m/user/root/.sparkStaging/application_1660917791379_0001/pyspark.zip
2022-08-19 14:12:03 INFO  Client:57 - Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.9-src.zip -> hdfs://vcf-validation-m/user/root/.sparkStaging/application_1660917791379_0001/py4j-0.10.9-src.zip
2022-08-19 14:12:03 INFO  Client:57 - Uploading resource file:/tmp/vcf_validation_20220819-1011/hail_scripts.zip -> hdfs://vcf-validation-m/user/root/.sparkStaging/application_1660917791379_0001/hail_scripts.zip
2022-08-19 14:12:04 INFO  Client:57 - Uploading resource file:/hadoop/spark/tmp/spark-f9445399-7ff9-4437-a95c-1fceaca753d9/__spark_conf__3459400324650576704.zip -> hdfs://vcf-validation-m/user/root/.sparkStaging/application_1660917791379_0001/__spark_conf__.zip
2022-08-19 14:12:04 INFO  SecurityManager:57 - Changing view acls to: root
2022-08-19 14:12:04 INFO  SecurityManager:57 - Changing modify acls to: root
2022-08-19 14:12:04 INFO  SecurityManager:57 - Changing view acls groups to: 
2022-08-19 14:12:04 INFO  SecurityManager:57 - Changing modify acls groups to: 
2022-08-19 14:12:04 INFO  SecurityManager:57 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2022-08-19 14:12:04 INFO  Client:57 - Submitting application application_1660917791379_0001 to ResourceManager
2022-08-19 14:12:04 INFO  YarnClientImpl:329 - Submitted application application_1660917791379_0001
2022-08-19 14:12:05 INFO  Client:57 - Application report for application_1660917791379_0001 (state: ACCEPTED)
2022-08-19 14:12:05 INFO  Client:57 - 
     client token: N/A
     diagnostics: AM container is launched, waiting for AM container to Register with RM
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1660918324665
     final status: UNDEFINED
     tracking URL: http://vcf-validation-m:8088/proxy/application_1660917791379_0001/
     user: root
2022-08-19 14:12:05 INFO  SecurityManager:57 - Changing view acls to: root
2022-08-19 14:12:05 INFO  SecurityManager:57 - Changing modify acls to: root
2022-08-19 14:12:05 INFO  SecurityManager:57 - Changing view acls groups to: 
2022-08-19 14:12:05 INFO  SecurityManager:57 - Changing modify acls groups to: 
2022-08-19 14:12:05 INFO  SecurityManager:57 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2022-08-19 14:12:05 INFO  RMProxy:134 - Connecting to ResourceManager at vcf-validation-m/10.128.0.5:8030
2022-08-19 14:12:06 INFO  YarnRMClient:57 - Registering the ApplicationMaster
2022-08-19 14:12:06 INFO  YarnClientSchedulerBackend:57 - Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> vcf-validation-m, PROXY_URI_BASES -> http://vcf-validation-m:8088/proxy/application_1660917791379_0001), /proxy/application_1660917791379_0001
2022-08-19 14:12:06 INFO  ApplicationMaster:57 - Preparing Local resources
2022-08-19 14:12:06 INFO  ApplicationMaster:57 - 
===============================================================================
Default YARN executor launch context:
  env:
    SPARK_WORKER_WEBUI_PORT -> 18081
    SPARK_ENV_LOADED -> 1
    CLASSPATH -> ./hail-all-spark.jar<CPS>{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>/usr/lib/spark/jars/*<CPS>:/etc/hive/conf:/usr/local/share/google/dataproc/lib/*:/usr/share/java/mysql.jar<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
    SPARK_LOG_DIR -> /var/log/spark
    SPARK_LOCAL_DIRS -> /hadoop/spark/tmp
    SPARK_DIST_CLASSPATH -> :/etc/hive/conf:/usr/local/share/google/dataproc/lib/*:/usr/share/java/mysql.jar
    SPARK_USER -> root
    SPARK_SUBMIT_OPTS ->  -Dscala.usejavacp=true
    SPARK_CONF_DIR -> /usr/lib/spark/conf
    PYTHONHASHSEED -> 0
    SPARK_HOME -> /usr/lib/spark/
    PYTHONPATH -> /usr/lib/spark/python/lib/pyspark.zip:/usr/lib/spark/python/lib/py4j-0.10.9-src.zip<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.9-src.zip<CPS>{{PWD}}/hail_scripts.zip
    SPARK_MASTER_PORT -> 7077
    OPENBLAS_NUM_THREADS -> 1
    SPARK_WORKER_DIR -> /hadoop/spark/work
    SPARK_WORKER_PORT -> 7078
    SPARK_DAEMON_MEMORY -> 4000m
    SPARK_MASTER_WEBUI_PORT -> 18080
    SPARK_LIBRARY_PATH -> :/usr/lib/hadoop/lib/native
    SPARK_SCALA_VERSION -> 2.12

  command:
    {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx21840m \ 
      '-Xss4M' \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.driver.port=34589' \ 
      '-Dspark.ui.port=0' \ 
      '-Dspark.rpc.message.maxSize=512' \ 
      -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
      -XX:OnOutOfMemoryError='kill %p' \ 
      org.apache.spark.executor.YarnCoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://CoarseGrainedScheduler@vcf-validation-m.c.seqr-project.internal:34589 \ 
      --executor-id \ 
      <executorId> \ 
      --hostname \ 
      <hostname> \ 
      --cores \ 
      4 \ 
      --app-id \ 
      application_1660917791379_0001 \ 
      --resourceProfileId \ 
      0 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      --user-class-path \ 
      file:$PWD/hail-all-spark.jar \ 
      1><LOG_DIR>/stdout \ 
      2><LOG_DIR>/stderr

  resources:
    __spark_conf__ -> resource { scheme: "hdfs" host: "vcf-validation-m" port: -1 file: "/user/root/.sparkStaging/application_1660917791379_0001/__spark_conf__.zip" } size: 268110 timestamp: 1660918324542 type: ARCHIVE visibility: PRIVATE
    pyspark.zip -> resource { scheme: "hdfs" host: "vcf-validation-m" port: -1 file: "/user/root/.sparkStaging/application_1660917791379_0001/pyspark.zip" } size: 887063 timestamp: 1660918323141 type: FILE visibility: PRIVATE
    py4j-0.10.9-src.zip -> resource { scheme: "hdfs" host: "vcf-validation-m" port: -1 file: "/user/root/.sparkStaging/application_1660917791379_0001/py4j-0.10.9-src.zip" } size: 41587 timestamp: 1660918323565 type: FILE visibility: PRIVATE
    hail_scripts.zip -> resource { scheme: "hdfs" host: "vcf-validation-m" port: -1 file: "/user/root/.sparkStaging/application_1660917791379_0001/hail_scripts.zip" } size: 110212 timestamp: 1660918323989 type: FILE visibility: PRIVATE
    hail-all-spark.jar -> resource { scheme: "hdfs" host: "vcf-validation-m" port: -1 file: "/user/root/.sparkStaging/application_1660917791379_0001/hail-all-spark.jar" } size: 101403518 timestamp: 1660918322562 type: FILE visibility: PRIVATE

===============================================================================
2022-08-19 14:12:06 INFO  Utils:57 - Using initial executors = 2, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
2022-08-19 14:12:06 INFO  YarnAllocator:57 - Resource profile 0 doesn't exist, adding it
2022-08-19 14:12:06 INFO  YarnSchedulerBackend$YarnSchedulerEndpoint:57 - ApplicationMaster registered as NettyRpcEndpointRef(spark://YarnAM@vcf-validation-m.c.seqr-project.internal:34589)
2022-08-19 14:12:06 INFO  YarnAllocator:57 - Will request 2 executor container(s) for  ResourceProfile Id: 0, each with 4 core(s) and 24024 MB memory. with custom resources: <memory:24024, vCores:4>
2022-08-19 14:12:06 INFO  YarnAllocator:57 - Submitted 2 unlocalized container requests.
2022-08-19 14:12:06 INFO  StatsdSink:57 - StatsdSink started with prefix: 'spark.applicationMaster'
2022-08-19 14:12:06 INFO  ApplicationMaster:57 - Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
2022-08-19 14:12:06 INFO  YarnAllocator:57 - Launching container container_1660917791379_0001_01_000001 on host vcf-validation-w-0.c.seqr-project.internal for executor with ID 1 for ResourceProfile Id 0
2022-08-19 14:12:06 INFO  YarnAllocator:57 - Received 1 containers from YARN, launching executors on 1 of them.
2022-08-19 14:12:06 INFO  Client:57 - Application report for application_1660917791379_0001 (state: RUNNING)
2022-08-19 14:12:06 INFO  Client:57 - 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 10.128.0.5
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1660918324665
     final status: UNDEFINED
     tracking URL: http://vcf-validation-m:8088/proxy/application_1660917791379_0001/
     user: root
2022-08-19 14:12:06 INFO  YarnClientSchedulerBackend:57 - Application application_1660917791379_0001 has started running.
2022-08-19 14:12:06 INFO  Utils:57 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39319.
2022-08-19 14:12:06 INFO  NettyBlockTransferService:81 - Server created on vcf-validation-m.c.seqr-project.internal:39319
2022-08-19 14:12:07 INFO  BlockManager:57 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2022-08-19 14:12:07 INFO  BlockManagerMaster:57 - Registering BlockManager BlockManagerId(driver, vcf-validation-m.c.seqr-project.internal, 39319, None)
2022-08-19 14:12:07 INFO  BlockManagerMasterEndpoint:57 - Registering block manager vcf-validation-m.c.seqr-project.internal:39319 with 21.7 GiB RAM, BlockManagerId(driver, vcf-validation-m.c.seqr-project.internal, 39319, None)
2022-08-19 14:12:07 INFO  BlockManagerMaster:57 - Registered BlockManager BlockManagerId(driver, vcf-validation-m.c.seqr-project.internal, 39319, None)
2022-08-19 14:12:07 INFO  BlockManager:57 - external shuffle service port = 7337
2022-08-19 14:12:07 INFO  BlockManager:57 - Initialized BlockManager: BlockManagerId(driver, vcf-validation-m.c.seqr-project.internal, 39319, None)
2022-08-19 14:12:07 INFO  StatsdSink:57 - StatsdSink started with prefix: 'spark.driver'
2022-08-19 14:12:07 INFO  ServerInfo:57 - Adding filter to /metrics/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
2022-08-19 14:12:07 INFO  ContextHandler:916 - Started o.s.j.s.ServletContextHandler@21f6be7{/metrics/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:07 INFO  YarnAllocator:57 - Launching container container_1660917791379_0001_01_000002 on host vcf-validation-w-1.c.seqr-project.internal for executor with ID 2 for ResourceProfile Id 0
2022-08-19 14:12:07 INFO  YarnAllocator:57 - Received 1 containers from YARN, launching executors on 1 of them.
2022-08-19 14:12:07 INFO  GoogleCloudStorageImpl:101 - Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
2022-08-19 14:12:08 INFO  SingleEventLogFileWriter:57 - Logging events to gs://dataproc-temp-us-central1-733952080251-pwl2itzn/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/spark-job-history/application_1660917791379_0001.inprogress
2022-08-19 14:12:08 INFO  Utils:57 - Using initial executors = 2, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
2022-08-19 14:12:08 INFO  YarnAllocator:57 - Resource profile 0 doesn't exist, adding it
2022-08-19 14:12:08 INFO  SparkContext:57 - Registered listener com.google.cloud.spark.performance.DataprocMetricsListener
2022-08-19 14:12:08 INFO  YarnClientSchedulerBackend:57 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
2022-08-19 14:12:08 INFO  Hail:28 - SparkUI: http://vcf-validation-m.c.seqr-project.internal:33355
Running on Apache Spark version 3.1.2
SparkUI available at http://vcf-validation-m.c.seqr-project.internal:33355
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.85-9b98676b6ad8
LOGGING: writing to /home/hail/hail-20220819-1411-0.2.85-9b98676b6ad8.log
loading time: 0:00:12.392576
2022-08-19 14:12:24 Hail: INFO: Coerced prefix-sorted dataset
2022-08-19 14:12:37 Hail: INFO: Coerced prefix-sorted dataset
2022-08-19 14:12:47 Hail: INFO: Coerced prefix-sorted dataset
Table contains 3 out of 1743 common noncoding variants.
Table contains 235 out of 314 common coding variants.
Sample type validation error: dataset sample-type is specified as WGS but appears to be WGS because it contains many common coding variants
validation time (including contig check): 0:00:41.325795
2022-08-19 14:12:55 Hail: INFO: Coerced prefix-sorted dataset
2022-08-19 14:13:02 Hail: INFO: Coerced prefix-sorted dataset
Table contains 3 out of 1743 common noncoding variants.
Table contains 235 out of 314 common coding variants.
Sample type validation error: dataset sample-type is specified as WGS but appears to be WGS because it contains many common coding variants
validation time (not including contig check): 0:00:14.798183
Job [vcf_validation_20220819-1011] finished successfully.
done: true
driverControlFilesUri: gs://dataproc-6d4ee93a-f906-4bfe-934f-eb1c2c786273-us-central1/google-cloud-dataproc-metainfo/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/jobs/vcf_validation_20220819-1011/
driverOutputResourceUri: gs://dataproc-6d4ee93a-f906-4bfe-934f-eb1c2c786273-us-central1/google-cloud-dataproc-metainfo/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/jobs/vcf_validation_20220819-1011/driveroutput
jobUuid: c000492d-8218-3275-9eab-7c558c1482c2
placement:
  clusterName: vcf-validation
  clusterUuid: b9aa3cd8-07ab-4a82-a507-2de6db0de4f5
pysparkJob:
  args:
  - --use-dataproc
  mainPythonFileUri: gs://dataproc-6d4ee93a-f906-4bfe-934f-eb1c2c786273-us-central1/google-cloud-dataproc-metainfo/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/jobs/vcf_validation_20220819-1011/staging/validate_vcf.py
  pythonFileUris:
  - gs://dataproc-6d4ee93a-f906-4bfe-934f-eb1c2c786273-us-central1/google-cloud-dataproc-metainfo/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/jobs/vcf_validation_20220819-1011/staging/hail_scripts.zip
reference:
  jobId: vcf_validation_20220819-1011
  projectId: seqr-project
status:
  state: DONE
  stateStartTime: '2022-08-19T14:13:09.121548Z'
statusHistory:
- state: PENDING
  stateStartTime: '2022-08-19T14:11:53.091031Z'
- state: SETUP_DONE
  stateStartTime: '2022-08-19T14:11:53.153155Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2022-08-19T14:11:53.628725Z'
yarnApplications:
- name: Hail
  progress: 1.0
  state: FINISHED
  trackingUrl: http://vcf-validation-m:8088/proxy/application_1660917791379_0001/

Process finished with exit code 0

hanars commented 2 years ago

I think its okay. I would be interested to see what the runtime would be with 1 worker and no secondary workers, just to see how it would perform on a single thread. I'd also be curious to know how big the VCF you were validating is (WES vs WGS and how many samples)

hanars commented 2 years ago

One thing we can do without hail is header validation, since we are already validating the sample IDs in the header. @mike-w-wilson will provide us with a list of the required info fields in the pipeline and then we can add a check that they are in the file to the validate VCF step.

hanars commented 2 years ago

@ShifaSZ we should validate that the following fields are in the VCF header: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO (containing AC,AN,AF), and FORMAT (containing AD, DP, GQ, GT)

ShifaSZ commented 2 years ago

The VCF header includes the meta information of the sub-fields (e.g,. AC, AN, AF of INFO) of the INFO and FORMAT. But there are no separate columns for these sub-fields. They locate in the variant rows instead of the VCF header. Here is an example of the VCF header and the sub-fields of the INFO and FORMAT fields.

#CHROM POS      ID         REF   ALT    QUAL  FILTER   INFO                             FORMAT       NA00001         NA00002          NA00003
20     14370    rs6054257  G     A      29    PASS    NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ  0|0:48:1:51,51  1|0:48:8:51,51   1/1:43:5:.,.
20     17330    .          T     A      3     q10     NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ  0|0:49:3:58,50  0|1:3:5:65,3     0/0:41:3

We will validate the metadata in the header for these subfields but won't validate them in the variant rows. Agree?

hanars commented 2 years ago

Since the impetus for this change was that we got a file missing AF, I do think it is important to check that the required INFO and FORMAT sub-fields are in the VCF. While its not perfect, I think we should read in the first non-header row of the VCF and parse the FORMAT and INFOP fields of just that first row to check if the sub-fields are there. But I agree that we should not try to validate every row

ShifaSZ commented 2 years ago

Do we need to validate the meta information for those sub-fields? The meta info is at the header, starting with ## and containing the information of types and number of the value, etc. Once a time, we received SV VCF data with incorrect gnomAD AF type.

hanars commented 2 years ago

Can you share an example of what that meta info header looks like? We should definitely not check the values in the VCF themselves to validate the type, but if the meta info is easy to parse from the header it might be nice to validate it

ShifaSZ commented 2 years ago

Example of the meta info for the sub-fields of INFO and FORMAT:

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

hanars commented 2 years ago

I think validating that there is an entry in the meta info is better than checking the first row of the data. Its up to you if you want to validate they Type or not

ShifaSZ commented 2 years ago

FORMAT (containing AD, DP, GQ, GT)

Do we need to validate the FORMAT fields based on variant type? e.g., VARIANT or SV?

hanars commented 2 years ago

we only support regular VARIANT VCFs in AnVIL loading, we do not support SVs at all

hanars commented 2 years ago

header validation has been added. Moving this back to blocked to track future work that requires hail

lynnpais commented 1 year ago

Closing out in favor of - https://github.com/broadinstitute/seqr-private/issues/1290

broadinstitute / seqr

Better AnVIL VCF validation #2720