Closed hanars closed 1 year ago
It seems to me that there is no genome build/sample type information in a VCF file. It does have a chromosome for each variant. But we don't check the chromosome (or contig) while loading.
There is code in the loading pipeline that does sample type and genome build validation. I was hoping we could abstract out that code and have it run directly in seqr, but we need to do some investigation to see if it is performant enough to run in under 2 minutes, which it may not be. Per the ticket description:
May need to wait for hail backend/ better ability to quickly run hail in seqr
I ran the VCF validation in a Jupyter Notebook on my MacBook. The time was almost 2 minutes. It could be faster on the seqr server but won't be too quick after adding frontend/backend communication time.
The validation code is as below.
GRCh37_STANDARD_CONTIGS = {'1','10','11','12','13','14','15','16','17','18','19','2','20','21','22','3','4','5','6','7','8','9','X','Y', 'MT'}
GRCh38_STANDARD_CONTIGS = {'chr1','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18','chr19','chr2','chr20','chr21','chr22','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chrX','chrY', 'chrM'}
OPTIONAL_CHROMOSOMES = ['MT', 'chrM', 'Y', 'chrY']
VARIANT_THRESHOLD = 100
CONST_GRCh37 = '37'
CONST_GRCh38 = '38'
GLOBAL_CONFIG = {
'validation_37_noncoding_ht': 'gs://seqr-reference-data/GRCh37/validate_ht/common_noncoding_variants.grch37.ht',
'validation_37_coding_ht': 'gs://seqr-reference-data/GRCh37/validate_ht/common_coding_variants.grch37.ht',
'validation_38_noncoding_ht': 'gs://seqr-reference-data/GRCh38/validate_ht/common_noncoding_variants.grch38.ht',
'validation_38_coding_ht': 'gs://seqr-reference-data/GRCh38/validate_ht/common_coding_variants.grch38.ht',
}
class SeqrValidationError(Exception):
pass
def import_vcf(genome_version, source_paths):
# Import the VCFs from inputs. Set min partitions so that local pipeline execution takes advantage of all CPUs.
recode = {}
if genome_version == "38":
recode = {f"{i}": f"chr{i}" for i in (list(range(1, 23)) + ['X', 'Y'])}
elif self.genome_version == "37":
recode = {f"chr{i}": f"{i}" for i in (list(range(1, 23)) + ['X', 'Y'])}
return hl.import_vcf([vcf_file for vcf_file in source_paths],
reference_genome='GRCh' + genome_version,
skip_invalid_loci=True,
contig_recoding=recode,
force_bgz=True, min_partitions=500)
def get_sample_type_stats(mt, genome_version, threshold=0.3):
"""
Calculate stats for sample type by checking against a list of common coding and non-coding variants.
If the match for each respective type is over the threshold, we return a match.
:param mt: Matrix Table to check
:param genome_version: reference genome version
:param threshold: if the matched percentage is over this threshold, we classify as match
:return: a dict of coding/non-coding to dict with 'matched_count', 'total_count' and 'match' boolean.
"""
stats = {}
types_to_ht_path = {
'noncoding': GLOBAL_CONFIG['validation_%s_noncoding_ht' % genome_version],
'coding': GLOBAL_CONFIG['validation_%s_coding_ht' % genome_version]
}
for sample_type, ht_path in types_to_ht_path.items():
ht = hl.read_table(ht_path)
stats[sample_type] = ht_stats = {
'matched_count': mt.semi_join_rows(ht).count_rows(),
'total_count': ht.count(),
}
ht_stats['match'] = (ht_stats['matched_count'] / ht_stats['total_count']) >= threshold
return stats
def contig_check(mt, standard_contigs, threshold):
check_result_dict = {}
# check chromosomes that are not in the VCF
row_dict = mt.aggregate_rows(hl.agg.counter(mt.locus.contig))
contigs_set = set(row_dict.keys())
all_missing_contigs = standard_contigs - contigs_set
missing_contigs_without_optional = [contig for contig in all_missing_contigs if contig not in OPTIONAL_CHROMOSOMES]
if missing_contigs_without_optional:
check_result_dict['Missing contig(s)'] = missing_contigs_without_optional
logger.warning('Missing the following chromosomes(s):{}'.format(', '.join(missing_contigs_without_optional)))
for k,v in row_dict.items():
if k not in standard_contigs:
check_result_dict.setdefault('Unexpected chromosome(s)',[]).append(k)
logger.warning('Chromosome %s is unexpected.', k)
elif (k not in OPTIONAL_CHROMOSOMES) and (v < threshold):
check_result_dict.setdefault(f'Chromosome(s) whose variants count under threshold {threshold}',[]).append(k)
logger.warning('Chromosome %s has %d rows, which is lower than threshold %d.', k, v, threshold)
return check_result_dict
def validate_mt(mt, genome_version, sample_type):
"""
Validate the mt by checking against a list of common coding and non-coding variants given its
genome version. This validates genome_version, variants, and the reported sample type.
:param mt: mt to validate
:param genome_version: reference genome version
:param sample_type: WGS or WES
:return: True or Exception
"""
if genome_version == CONST_GRCh37:
contig_check_result = contig_check(mt, GRCh37_STANDARD_CONTIGS, VARIANT_THRESHOLD)
elif genome_version == CONST_GRCh38:
contig_check_result = contig_check(mt, GRCh38_STANDARD_CONTIGS, VARIANT_THRESHOLD)
if bool(contig_check_result):
err_msg = ''
for k,v in contig_check_result.items():
err_msg += '{k}: {v}. '.format(k=k, v=', '.join(v))
# raise SeqrValidationError(err_msg)
print(err_msg)
sample_type_stats = get_sample_type_stats(mt, genome_version)
for name, stat in sample_type_stats.items():
logger.info('Table contains %i out of %i common %s variants.' %
(stat['matched_count'], stat['total_count'], name))
has_coding = sample_type_stats['coding']['match']
has_noncoding = sample_type_stats['noncoding']['match']
if not has_coding and not has_noncoding:
# No common variants detected.
# raise SeqrValidationError(
print(
'Genome version validation error: dataset specified as GRCh{genome_version} but doesn\'t contain '
'the expected number of common GRCh{genome_version} variants'.format(genome_version=genome_version)
)
elif has_noncoding and not has_coding:
# Non coding only.
# raise SeqrValidationError(
print(
'Sample type validation error: Dataset contains noncoding variants but is missing common coding '
'variants for GRCh{}. Please verify that the dataset contains coding variants.' .format(genome_version)
)
elif has_coding and not has_noncoding:
# Only coding should be WES.
if sample_type != 'WES':
# raise SeqrValidationError(
print(
'Sample type validation error: dataset sample-type is specified as {} but appears to be '
'WGS because it contains many common coding variants'.format(sample_type)
)
elif has_noncoding and has_coding:
# Both should be WGS.
if sample_type != 'WGS':
# raise SeqrValidationError(
print(
'Sample type validation error: dataset sample-type is specified as {} but appears to be '
'WES because it contains many common non-coding variants'.format(sample_type)
)
return True
Whats the runtime if we don't do the contig check, only the sample_type_stat checks for build and saple type validation?
The speed on my local computer is much slower (3 minutes without a contig check) today. I'll try it on a Google cloud engine.
Can you run both with and without the conting check on google cloud engine to make it easier to compare?
The results on a dataproc are as below.
validation time (including contig check): 0:00:41.325795
validation time (not including contig check): 0:00:14.798183
They look good.
Yeah those are very reasonable times, based on that we will want to take this validation live in seqr, but first we will need to get hail properly set up in the seqr deployment. I'm going to mark this ticket as blocked on the hail ticket, and we should hold off on working on this until that is done. Thanks for all your hard work looking into this!
The complete outputs while running the dataproc. Are the parameter settings okay?
python hail_scripts/validate_vcf/run_dataproc_validate_vcf.py
Cost: $0.95/h + $0.10 preemptible/h = $1.0472000000000001 / hour
gcloud beta dataproc clusters create vcf-validation --region=us-central1 --max-idle=30m --master-machine-type=n1-highmem-8 --master-boot-disk-size=100GB --num-workers=2 --num-secondary-workers=1 --secondary-worker-boot-disk-size=40GB --worker-machine-type=n1-highmem-8 --worker-boot-disk-size=40GB --image-version=2.0.29-debian10 --metadata=WHEEL=gs://hail-common/hailctl/dataproc/0.2.85/hail-0.2.85-py3-none-any.whl,PKGS=aiohttp==3.7.4\|aiohttp_session==2.7.0\|asyncinit==0.2.4\|avro==1.10.2\|bokeh==1.4.0\|boto3==1.21.28\|botocore==1.24.28\|decorator==4.4.2\|Deprecated==1.2.12\|dill==0.3.3\|gcsfs==2021.11.1\|google-auth==1.27.0\|google-cloud-storage==1.25.0\|humanize==1.0.0\|hurry.filesize==0.9\|janus==0.6.2\|nest_asyncio==1.5.4\|numpy==1.20.1\|orjson==3.6.4\|pandas==1.3.5\|parsimonious==0.8.1\|plotly==5.5.0\|PyJWT\|python-json-logger==0.1.11\|requests==2.25.1\|scipy==1.6.1\|sortedcontainers==2.1.0\|tabulate==0.8.3\|tqdm==4.42.1\|uvloop==0.16.0\|luigi\|google-api-python-client\|httplib2==0.19.1\|pyparsing==2.4.7 --properties=dataproc:dataproc.cluster-ttl.consider-yarn-activity=false,spark:spark.driver.memory=41g,spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,hdfs:dfs.replication=1 --initialization-actions=gs://hail-common/hailctl/dataproc/0.2.85/init_notebook.py
ERROR: (gcloud.beta.dataproc.clusters.create) ALREADY_EXISTS: Already exists: Failed to create cluster: Cluster projects/seqr-project/regions/us-central1/clusters/vcf-validation
updating: hail_scripts/ (stored 0%)
updating: hail_scripts/.DS_Store (deflated 96%)
updating: hail_scripts/__init__.py (stored 0%)
updating: hail_scripts/utils/ (stored 0%)
updating: hail_scripts/utils/clinvar.py (deflated 59%)
updating: hail_scripts/utils/hail_utils.py (deflated 67%)
updating: hail_scripts/utils/__init__.py (stored 0%)
updating: hail_scripts/utils/__pycache__/ (stored 0%)
updating: hail_scripts/utils/__pycache__/__init__.cpython-38.pyc (deflated 22%)
updating: hail_scripts/utils/__pycache__/clinvar.cpython-37.pyc (deflated 40%)
updating: hail_scripts/utils/__pycache__/hail_utils.cpython-38.pyc (deflated 48%)
updating: hail_scripts/utils/__pycache__/hail_utils.cpython-37.pyc (deflated 47%)
updating: hail_scripts/utils/__pycache__/__init__.cpython-37.pyc (deflated 20%)
updating: hail_scripts/utils/__pycache__/clinvar.cpython-38.pyc (deflated 41%)
updating: hail_scripts/shared/ (stored 0%)
updating: hail_scripts/shared/__pycache__/ (stored 0%)
updating: hail_scripts/shared/__pycache__/elasticsearch_client_v7.cpython-37.pyc (deflated 57%)
updating: hail_scripts/shared/__pycache__/elasticsearch_utils.cpython-37.pyc (deflated 39%)
updating: hail_scripts/shared/__pycache__/__init__.cpython-37.pyc (deflated 22%)
updating: hail_scripts/__pycache__/ (stored 0%)
updating: hail_scripts/__pycache__/__init__.cpython-38.pyc (deflated 24%)
updating: hail_scripts/__pycache__/__init__.cpython-37.pyc (deflated 22%)
updating: hail_scripts/validate_vcf/ (stored 0%)
updating: hail_scripts/validate_vcf/run_dataproc_validate_vcf.py (deflated 48%)
updating: hail_scripts/validate_vcf/__init__.py (stored 0%)
updating: hail_scripts/update_models/ (stored 0%)
updating: hail_scripts/update_models/update_mt_schema.py (deflated 71%)
updating: hail_scripts/update_models/__init__.py (stored 0%)
updating: hail_scripts/v02/ (stored 0%)
updating: hail_scripts/v02/utils/ (stored 0%)
updating: hail_scripts/v02/utils/__pycache__/ (stored 0%)
updating: hail_scripts/v02/utils/__pycache__/elasticsearch_client.cpython-37.pyc (deflated 57%)
updating: hail_scripts/v02/utils/__pycache__/elasticsearch_utils.cpython-37.pyc (deflated 51%)
updating: hail_scripts/v02/utils/__pycache__/__init__.cpython-37.pyc (deflated 20%)
updating: hail_scripts/v02/__pycache__/ (stored 0%)
updating: hail_scripts/v02/__pycache__/__init__.cpython-37.pyc (deflated 21%)
updating: hail_scripts/computed_fields/ (stored 0%)
updating: hail_scripts/computed_fields/test_variant_id.py (deflated 59%)
updating: hail_scripts/computed_fields/flags.py (deflated 82%)
updating: hail_scripts/computed_fields/__init__.py (deflated 38%)
updating: hail_scripts/computed_fields/vep.py (deflated 75%)
updating: hail_scripts/computed_fields/__pycache__/ (stored 0%)
updating: hail_scripts/computed_fields/__pycache__/__init__.cpython-38.pyc (deflated 21%)
updating: hail_scripts/computed_fields/__pycache__/flags.cpython-37.pyc (deflated 76%)
updating: hail_scripts/computed_fields/__pycache__/variant_id.cpython-38.pyc (deflated 52%)
updating: hail_scripts/computed_fields/__pycache__/vep.cpython-38.pyc (deflated 60%)
updating: hail_scripts/computed_fields/__pycache__/variant_id.cpython-37.pyc (deflated 53%)
updating: hail_scripts/computed_fields/__pycache__/vep.cpython-37.pyc (deflated 60%)
updating: hail_scripts/computed_fields/__pycache__/flags.cpython-38.pyc (deflated 75%)
updating: hail_scripts/computed_fields/__pycache__/__init__.cpython-37.pyc (deflated 19%)
updating: hail_scripts/computed_fields/test_flags.py (deflated 87%)
updating: hail_scripts/computed_fields/variant_id.py (deflated 69%)
updating: hail_scripts/elasticsearch/ (stored 0%)
updating: hail_scripts/elasticsearch/elasticsearch_client_v7.py (deflated 71%)
updating: hail_scripts/elasticsearch/__init__.py (stored 0%)
updating: hail_scripts/elasticsearch/__pycache__/ (stored 0%)
updating: hail_scripts/elasticsearch/__pycache__/__init__.cpython-38.pyc (deflated 27%)
updating: hail_scripts/elasticsearch/__pycache__/hail_elasticsearch_client.cpython-37.pyc (deflated 57%)
updating: hail_scripts/elasticsearch/__pycache__/elasticsearch_utils.cpython-38.pyc (deflated 49%)
updating: hail_scripts/elasticsearch/__pycache__/elasticsearch_client_v7.cpython-37.pyc (deflated 57%)
updating: hail_scripts/elasticsearch/__pycache__/elasticsearch_client_v7.cpython-38.pyc (deflated 56%)
updating: hail_scripts/elasticsearch/__pycache__/elasticsearch_utils.cpython-37.pyc (deflated 48%)
updating: hail_scripts/elasticsearch/__pycache__/__init__.cpython-37.pyc (deflated 26%)
updating: hail_scripts/elasticsearch/__pycache__/hail_elasticsearch_client.cpython-38.pyc (deflated 57%)
updating: hail_scripts/elasticsearch/hail_elasticsearch_client.py (deflated 72%)
updating: hail_scripts/elasticsearch/elasticsearch_utils.py (deflated 70%)
updating: hail_scripts/elasticsearch/elasticsearch_utils_tests.py (deflated 67%)
adding: hail_scripts/validate_vcf/validate_vcf.py (deflated 71%)
gcloud dataproc jobs submit pyspark --cluster=vcf-validation --py-files=/var/folders/p8/c2yjwplx5n5c8z8s5c91ddqc0000gq/T/hail_scripts.zip --region=us-central1 --id=vcf_validation_20220819-1011 "hail_scripts/validate_vcf/validate_vcf.py" -- "--use-dataproc"
/Users/shifa/dev/hail_elasticsearch_pipelines
Job [vcf_validation_20220819-1011] submitted.
Waiting for job output...
Initializing Hail with default parameters...
2022-08-19 14:11:58 INFO SparkContext:57 - Running Spark version 3.1.2
2022-08-19 14:11:59 INFO ResourceUtils:57 - ==============================================================
2022-08-19 14:11:59 INFO ResourceUtils:57 - No custom resources configured for spark.driver.
2022-08-19 14:11:59 INFO ResourceUtils:57 - ==============================================================
2022-08-19 14:11:59 INFO SparkContext:57 - Submitted application: Hail
2022-08-19 14:11:59 INFO SparkContext:57 - Spark configuration:
spark.app.name=Hail
spark.app.startTime=1660918318971
spark.driver.extraClassPath=/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar
spark.driver.extraJavaOptions=-Xss4M
spark.driver.maxResultSize=0
spark.driver.memory=41g
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.maxExecutors=10000
spark.dynamicAllocation.minExecutors=1
spark.eventLog.dir=gs://dataproc-temp-us-central1-733952080251-pwl2itzn/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/spark-job-history
spark.eventLog.enabled=true
spark.executor.cores=4
spark.executor.extraClassPath=./hail-all-spark.jar
spark.executor.extraJavaOptions=-Xss4M
spark.executor.instances=2
spark.executor.memory=21840m
spark.executorEnv.OPENBLAS_NUM_THREADS=1
spark.executorEnv.PYTHONHASHSEED=0
spark.extraListeners=com.google.cloud.spark.performance.DataprocMetricsListener
spark.hadoop.hive.execution.engine=mr
spark.hadoop.io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,is.hail.io.compress.BGzipCodec,is.hail.io.compress.BGzipCodecTbi,org.apache.hadoop.io.compress.GzipCodec
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
spark.hadoop.mapreduce.input.fileinputformat.split.minsize=0
spark.history.fs.logDirectory=gs://dataproc-temp-us-central1-733952080251-pwl2itzn/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/spark-job-history
spark.jars=file:/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar
spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator
spark.kryoserializer.buffer.max=1g
spark.logConf=true
spark.master=yarn
spark.metrics.namespace=app_name:${spark.app.name}.app_id:${spark.app.id}
spark.repl.local.jars=file:///opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar
spark.rpc.message.maxSize=512
spark.scheduler.minRegisteredResourcesRatio=0.0
spark.scheduler.mode=FAIR
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled=true
spark.sql.adaptive.enabled=true
spark.sql.autoBroadcastJoinThreshold=163m
spark.sql.catalogImplementation=hive
spark.sql.cbo.enabled=true
spark.sql.cbo.joinReorder.enabled=true
spark.submit.deployMode=client
spark.submit.pyFiles=/tmp/vcf_validation_20220819-1011/hail_scripts.zip
spark.task.maxFailures=20
spark.ui.port=0
spark.ui.showConsoleProgress=false
spark.yarn.am.memory=640m
spark.yarn.dist.jars=file:///opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar
spark.yarn.dist.pyFiles=file:///tmp/vcf_validation_20220819-1011/hail_scripts.zip
spark.yarn.historyServer.address=vcf-validation-m:18080
spark.yarn.isPython=true
spark.yarn.jars=local:/usr/lib/spark/jars/*
spark.yarn.tags=dataproc_hash_69b2d16a-5a69-335e-b202-5381c4fcb4d3,dataproc_job_vcf_validation_20220819-1011,dataproc_master_index_0,dataproc_uuid_c000492d-8218-3275-9eab-7c558c1482c2
spark.yarn.unmanagedAM.enabled=true
2022-08-19 14:11:59 INFO ResourceProfile:57 - Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 4, script: , vendor: , memory -> name: memory, amount: 21840, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2022-08-19 14:11:59 INFO ResourceProfile:57 - Limiting resource is cpus at 4 tasks per executor
2022-08-19 14:11:59 INFO ResourceProfileManager:57 - Added ResourceProfile id: 0
2022-08-19 14:11:59 INFO SecurityManager:57 - Changing view acls to: root
2022-08-19 14:11:59 INFO SecurityManager:57 - Changing modify acls to: root
2022-08-19 14:11:59 INFO SecurityManager:57 - Changing view acls groups to:
2022-08-19 14:11:59 INFO SecurityManager:57 - Changing modify acls groups to:
2022-08-19 14:11:59 INFO SecurityManager:57 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2022-08-19 14:11:59 INFO Utils:57 - Successfully started service 'sparkDriver' on port 34589.
2022-08-19 14:11:59 INFO SparkEnv:57 - Registering MapOutputTracker
2022-08-19 14:11:59 INFO SparkEnv:57 - Registering BlockManagerMaster
2022-08-19 14:11:59 INFO BlockManagerMasterEndpoint:57 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2022-08-19 14:11:59 INFO BlockManagerMasterEndpoint:57 - BlockManagerMasterEndpoint up
2022-08-19 14:11:59 INFO SparkEnv:57 - Registering BlockManagerMasterHeartbeat
2022-08-19 14:11:59 INFO DiskBlockManager:57 - Created local directory at /hadoop/spark/tmp/blockmgr-00cfb18f-30b3-46ea-b6e4-4b5e54d8f485
2022-08-19 14:11:59 INFO MemoryStore:57 - MemoryStore started with capacity 21.7 GiB
2022-08-19 14:11:59 INFO SparkEnv:57 - Registering OutputCommitCoordinator
2022-08-19 14:11:59 INFO log:169 - Logging initialized @5342ms to org.sparkproject.jetty.util.log.Slf4jLog
2022-08-19 14:11:59 INFO Server:375 - jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_312-b07
2022-08-19 14:11:59 INFO Server:415 - Started @5448ms
2022-08-19 14:11:59 INFO AbstractConnector:331 - Started ServerConnector@2e001c12{HTTP/1.1, (http/1.1)}{0.0.0.0:33355}
2022-08-19 14:11:59 INFO Utils:57 - Successfully started service 'SparkUI' on port 33355.
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@5c89ad61{/jobs,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@b5ed6c5{/jobs/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@636b0d01{/jobs/job,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@5f9b6a3a{/jobs/job/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@67ae0a7d{/stages,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@353e6d85{/stages/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@5db6ff34{/stages/stage,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@6da1eb6a{/stages/stage/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@b819c95{/stages/pool,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@4962f1c9{/stages/pool/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@3cb9d0f0{/storage,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@1ca446bf{/storage/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@9650802{/storage/rdd,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@457200aa{/storage/rdd/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@54210dac{/environment,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@134ac3ff{/environment/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@13d39293{/executors,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@574cde82{/executors/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@ac06593{/executors/threadDump,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@4c721bef{/executors/threadDump/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@11e3f5b{/static,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@6ef7ea64{/,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@8bf39b4{/api,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@375d18c{/jobs/job/kill,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@5c41043{/stages/stage/kill,null,AVAILABLE,@Spark}
2022-08-19 14:12:00 INFO SparkUI:57 - Bound SparkUI to 0.0.0.0, and started at http://vcf-validation-m.c.seqr-project.internal:33355
2022-08-19 14:12:00 INFO SparkContext:57 - Added JAR file:/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar at spark://vcf-validation-m.c.seqr-project.internal:34589/jars/hail-all-spark.jar with timestamp 1660918318971
2022-08-19 14:12:00 INFO FairSchedulableBuilder:57 - Creating Fair Scheduler pools from default file: fairscheduler.xml
2022-08-19 14:12:00 INFO FairSchedulableBuilder:57 - Created pool: default, schedulingMode: FAIR, minShare: 0, weight: 1
2022-08-19 14:12:00 INFO Utils:57 - Using initial executors = 2, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
2022-08-19 14:12:00 INFO RMProxy:134 - Connecting to ResourceManager at vcf-validation-m/10.128.0.5:8032
2022-08-19 14:12:00 INFO AHSProxy:42 - Connecting to Application History server at vcf-validation-m/10.128.0.5:10200
2022-08-19 14:12:00 INFO Client:57 - Requesting a new application from cluster with 3 NodeManagers
2022-08-19 14:12:01 INFO Configuration:2795 - resource-types.xml not found
2022-08-19 14:12:01 INFO ResourceUtils:442 - Unable to find 'resource-types.xml'.
2022-08-19 14:12:01 INFO Client:57 - Verifying our application has not requested more than the maximum memory capability of the cluster (48048 MB per container)
2022-08-19 14:12:01 INFO Client:57 - Will allocate AM container, with 1024 MB memory including 384 MB overhead
2022-08-19 14:12:01 INFO Client:57 - Setting up container launch context for our AM
2022-08-19 14:12:01 INFO Client:57 - Setting up the launch environment for our AM container
2022-08-19 14:12:01 INFO Client:57 - Preparing resources for our AM container
2022-08-19 14:12:01 INFO Client:57 - Uploading resource file:/opt/conda/miniconda3/lib/python3.8/site-packages/hail/backend/hail-all-spark.jar -> hdfs://vcf-validation-m/user/root/.sparkStaging/application_1660917791379_0001/hail-all-spark.jar
2022-08-19 14:12:02 INFO Client:57 - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://vcf-validation-m/user/root/.sparkStaging/application_1660917791379_0001/pyspark.zip
2022-08-19 14:12:03 INFO Client:57 - Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.9-src.zip -> hdfs://vcf-validation-m/user/root/.sparkStaging/application_1660917791379_0001/py4j-0.10.9-src.zip
2022-08-19 14:12:03 INFO Client:57 - Uploading resource file:/tmp/vcf_validation_20220819-1011/hail_scripts.zip -> hdfs://vcf-validation-m/user/root/.sparkStaging/application_1660917791379_0001/hail_scripts.zip
2022-08-19 14:12:04 INFO Client:57 - Uploading resource file:/hadoop/spark/tmp/spark-f9445399-7ff9-4437-a95c-1fceaca753d9/__spark_conf__3459400324650576704.zip -> hdfs://vcf-validation-m/user/root/.sparkStaging/application_1660917791379_0001/__spark_conf__.zip
2022-08-19 14:12:04 INFO SecurityManager:57 - Changing view acls to: root
2022-08-19 14:12:04 INFO SecurityManager:57 - Changing modify acls to: root
2022-08-19 14:12:04 INFO SecurityManager:57 - Changing view acls groups to:
2022-08-19 14:12:04 INFO SecurityManager:57 - Changing modify acls groups to:
2022-08-19 14:12:04 INFO SecurityManager:57 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2022-08-19 14:12:04 INFO Client:57 - Submitting application application_1660917791379_0001 to ResourceManager
2022-08-19 14:12:04 INFO YarnClientImpl:329 - Submitted application application_1660917791379_0001
2022-08-19 14:12:05 INFO Client:57 - Application report for application_1660917791379_0001 (state: ACCEPTED)
2022-08-19 14:12:05 INFO Client:57 -
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1660918324665
final status: UNDEFINED
tracking URL: http://vcf-validation-m:8088/proxy/application_1660917791379_0001/
user: root
2022-08-19 14:12:05 INFO SecurityManager:57 - Changing view acls to: root
2022-08-19 14:12:05 INFO SecurityManager:57 - Changing modify acls to: root
2022-08-19 14:12:05 INFO SecurityManager:57 - Changing view acls groups to:
2022-08-19 14:12:05 INFO SecurityManager:57 - Changing modify acls groups to:
2022-08-19 14:12:05 INFO SecurityManager:57 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2022-08-19 14:12:05 INFO RMProxy:134 - Connecting to ResourceManager at vcf-validation-m/10.128.0.5:8030
2022-08-19 14:12:06 INFO YarnRMClient:57 - Registering the ApplicationMaster
2022-08-19 14:12:06 INFO YarnClientSchedulerBackend:57 - Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> vcf-validation-m, PROXY_URI_BASES -> http://vcf-validation-m:8088/proxy/application_1660917791379_0001), /proxy/application_1660917791379_0001
2022-08-19 14:12:06 INFO ApplicationMaster:57 - Preparing Local resources
2022-08-19 14:12:06 INFO ApplicationMaster:57 -
===============================================================================
Default YARN executor launch context:
env:
SPARK_WORKER_WEBUI_PORT -> 18081
SPARK_ENV_LOADED -> 1
CLASSPATH -> ./hail-all-spark.jar<CPS>{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>/usr/lib/spark/jars/*<CPS>:/etc/hive/conf:/usr/local/share/google/dataproc/lib/*:/usr/share/java/mysql.jar<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
SPARK_LOG_DIR -> /var/log/spark
SPARK_LOCAL_DIRS -> /hadoop/spark/tmp
SPARK_DIST_CLASSPATH -> :/etc/hive/conf:/usr/local/share/google/dataproc/lib/*:/usr/share/java/mysql.jar
SPARK_USER -> root
SPARK_SUBMIT_OPTS -> -Dscala.usejavacp=true
SPARK_CONF_DIR -> /usr/lib/spark/conf
PYTHONHASHSEED -> 0
SPARK_HOME -> /usr/lib/spark/
PYTHONPATH -> /usr/lib/spark/python/lib/pyspark.zip:/usr/lib/spark/python/lib/py4j-0.10.9-src.zip<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.9-src.zip<CPS>{{PWD}}/hail_scripts.zip
SPARK_MASTER_PORT -> 7077
OPENBLAS_NUM_THREADS -> 1
SPARK_WORKER_DIR -> /hadoop/spark/work
SPARK_WORKER_PORT -> 7078
SPARK_DAEMON_MEMORY -> 4000m
SPARK_MASTER_WEBUI_PORT -> 18080
SPARK_LIBRARY_PATH -> :/usr/lib/hadoop/lib/native
SPARK_SCALA_VERSION -> 2.12
command:
{{JAVA_HOME}}/bin/java \
-server \
-Xmx21840m \
'-Xss4M' \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.driver.port=34589' \
'-Dspark.ui.port=0' \
'-Dspark.rpc.message.maxSize=512' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
-XX:OnOutOfMemoryError='kill %p' \
org.apache.spark.executor.YarnCoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler@vcf-validation-m.c.seqr-project.internal:34589 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
4 \
--app-id \
application_1660917791379_0001 \
--resourceProfileId \
0 \
--user-class-path \
file:$PWD/__app__.jar \
--user-class-path \
file:$PWD/hail-all-spark.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
__spark_conf__ -> resource { scheme: "hdfs" host: "vcf-validation-m" port: -1 file: "/user/root/.sparkStaging/application_1660917791379_0001/__spark_conf__.zip" } size: 268110 timestamp: 1660918324542 type: ARCHIVE visibility: PRIVATE
pyspark.zip -> resource { scheme: "hdfs" host: "vcf-validation-m" port: -1 file: "/user/root/.sparkStaging/application_1660917791379_0001/pyspark.zip" } size: 887063 timestamp: 1660918323141 type: FILE visibility: PRIVATE
py4j-0.10.9-src.zip -> resource { scheme: "hdfs" host: "vcf-validation-m" port: -1 file: "/user/root/.sparkStaging/application_1660917791379_0001/py4j-0.10.9-src.zip" } size: 41587 timestamp: 1660918323565 type: FILE visibility: PRIVATE
hail_scripts.zip -> resource { scheme: "hdfs" host: "vcf-validation-m" port: -1 file: "/user/root/.sparkStaging/application_1660917791379_0001/hail_scripts.zip" } size: 110212 timestamp: 1660918323989 type: FILE visibility: PRIVATE
hail-all-spark.jar -> resource { scheme: "hdfs" host: "vcf-validation-m" port: -1 file: "/user/root/.sparkStaging/application_1660917791379_0001/hail-all-spark.jar" } size: 101403518 timestamp: 1660918322562 type: FILE visibility: PRIVATE
===============================================================================
2022-08-19 14:12:06 INFO Utils:57 - Using initial executors = 2, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
2022-08-19 14:12:06 INFO YarnAllocator:57 - Resource profile 0 doesn't exist, adding it
2022-08-19 14:12:06 INFO YarnSchedulerBackend$YarnSchedulerEndpoint:57 - ApplicationMaster registered as NettyRpcEndpointRef(spark://YarnAM@vcf-validation-m.c.seqr-project.internal:34589)
2022-08-19 14:12:06 INFO YarnAllocator:57 - Will request 2 executor container(s) for ResourceProfile Id: 0, each with 4 core(s) and 24024 MB memory. with custom resources: <memory:24024, vCores:4>
2022-08-19 14:12:06 INFO YarnAllocator:57 - Submitted 2 unlocalized container requests.
2022-08-19 14:12:06 INFO StatsdSink:57 - StatsdSink started with prefix: 'spark.applicationMaster'
2022-08-19 14:12:06 INFO ApplicationMaster:57 - Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
2022-08-19 14:12:06 INFO YarnAllocator:57 - Launching container container_1660917791379_0001_01_000001 on host vcf-validation-w-0.c.seqr-project.internal for executor with ID 1 for ResourceProfile Id 0
2022-08-19 14:12:06 INFO YarnAllocator:57 - Received 1 containers from YARN, launching executors on 1 of them.
2022-08-19 14:12:06 INFO Client:57 - Application report for application_1660917791379_0001 (state: RUNNING)
2022-08-19 14:12:06 INFO Client:57 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.128.0.5
ApplicationMaster RPC port: -1
queue: default
start time: 1660918324665
final status: UNDEFINED
tracking URL: http://vcf-validation-m:8088/proxy/application_1660917791379_0001/
user: root
2022-08-19 14:12:06 INFO YarnClientSchedulerBackend:57 - Application application_1660917791379_0001 has started running.
2022-08-19 14:12:06 INFO Utils:57 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39319.
2022-08-19 14:12:06 INFO NettyBlockTransferService:81 - Server created on vcf-validation-m.c.seqr-project.internal:39319
2022-08-19 14:12:07 INFO BlockManager:57 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2022-08-19 14:12:07 INFO BlockManagerMaster:57 - Registering BlockManager BlockManagerId(driver, vcf-validation-m.c.seqr-project.internal, 39319, None)
2022-08-19 14:12:07 INFO BlockManagerMasterEndpoint:57 - Registering block manager vcf-validation-m.c.seqr-project.internal:39319 with 21.7 GiB RAM, BlockManagerId(driver, vcf-validation-m.c.seqr-project.internal, 39319, None)
2022-08-19 14:12:07 INFO BlockManagerMaster:57 - Registered BlockManager BlockManagerId(driver, vcf-validation-m.c.seqr-project.internal, 39319, None)
2022-08-19 14:12:07 INFO BlockManager:57 - external shuffle service port = 7337
2022-08-19 14:12:07 INFO BlockManager:57 - Initialized BlockManager: BlockManagerId(driver, vcf-validation-m.c.seqr-project.internal, 39319, None)
2022-08-19 14:12:07 INFO StatsdSink:57 - StatsdSink started with prefix: 'spark.driver'
2022-08-19 14:12:07 INFO ServerInfo:57 - Adding filter to /metrics/json: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
2022-08-19 14:12:07 INFO ContextHandler:916 - Started o.s.j.s.ServletContextHandler@21f6be7{/metrics/json,null,AVAILABLE,@Spark}
2022-08-19 14:12:07 INFO YarnAllocator:57 - Launching container container_1660917791379_0001_01_000002 on host vcf-validation-w-1.c.seqr-project.internal for executor with ID 2 for ResourceProfile Id 0
2022-08-19 14:12:07 INFO YarnAllocator:57 - Received 1 containers from YARN, launching executors on 1 of them.
2022-08-19 14:12:07 INFO GoogleCloudStorageImpl:101 - Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
2022-08-19 14:12:08 INFO SingleEventLogFileWriter:57 - Logging events to gs://dataproc-temp-us-central1-733952080251-pwl2itzn/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/spark-job-history/application_1660917791379_0001.inprogress
2022-08-19 14:12:08 INFO Utils:57 - Using initial executors = 2, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
2022-08-19 14:12:08 INFO YarnAllocator:57 - Resource profile 0 doesn't exist, adding it
2022-08-19 14:12:08 INFO SparkContext:57 - Registered listener com.google.cloud.spark.performance.DataprocMetricsListener
2022-08-19 14:12:08 INFO YarnClientSchedulerBackend:57 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
2022-08-19 14:12:08 INFO Hail:28 - SparkUI: http://vcf-validation-m.c.seqr-project.internal:33355
Running on Apache Spark version 3.1.2
SparkUI available at http://vcf-validation-m.c.seqr-project.internal:33355
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.85-9b98676b6ad8
LOGGING: writing to /home/hail/hail-20220819-1411-0.2.85-9b98676b6ad8.log
loading time: 0:00:12.392576
2022-08-19 14:12:24 Hail: INFO: Coerced prefix-sorted dataset
2022-08-19 14:12:37 Hail: INFO: Coerced prefix-sorted dataset
2022-08-19 14:12:47 Hail: INFO: Coerced prefix-sorted dataset
Table contains 3 out of 1743 common noncoding variants.
Table contains 235 out of 314 common coding variants.
Sample type validation error: dataset sample-type is specified as WGS but appears to be WGS because it contains many common coding variants
validation time (including contig check): 0:00:41.325795
2022-08-19 14:12:55 Hail: INFO: Coerced prefix-sorted dataset
2022-08-19 14:13:02 Hail: INFO: Coerced prefix-sorted dataset
Table contains 3 out of 1743 common noncoding variants.
Table contains 235 out of 314 common coding variants.
Sample type validation error: dataset sample-type is specified as WGS but appears to be WGS because it contains many common coding variants
validation time (not including contig check): 0:00:14.798183
Job [vcf_validation_20220819-1011] finished successfully.
done: true
driverControlFilesUri: gs://dataproc-6d4ee93a-f906-4bfe-934f-eb1c2c786273-us-central1/google-cloud-dataproc-metainfo/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/jobs/vcf_validation_20220819-1011/
driverOutputResourceUri: gs://dataproc-6d4ee93a-f906-4bfe-934f-eb1c2c786273-us-central1/google-cloud-dataproc-metainfo/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/jobs/vcf_validation_20220819-1011/driveroutput
jobUuid: c000492d-8218-3275-9eab-7c558c1482c2
placement:
clusterName: vcf-validation
clusterUuid: b9aa3cd8-07ab-4a82-a507-2de6db0de4f5
pysparkJob:
args:
- --use-dataproc
mainPythonFileUri: gs://dataproc-6d4ee93a-f906-4bfe-934f-eb1c2c786273-us-central1/google-cloud-dataproc-metainfo/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/jobs/vcf_validation_20220819-1011/staging/validate_vcf.py
pythonFileUris:
- gs://dataproc-6d4ee93a-f906-4bfe-934f-eb1c2c786273-us-central1/google-cloud-dataproc-metainfo/b9aa3cd8-07ab-4a82-a507-2de6db0de4f5/jobs/vcf_validation_20220819-1011/staging/hail_scripts.zip
reference:
jobId: vcf_validation_20220819-1011
projectId: seqr-project
status:
state: DONE
stateStartTime: '2022-08-19T14:13:09.121548Z'
statusHistory:
- state: PENDING
stateStartTime: '2022-08-19T14:11:53.091031Z'
- state: SETUP_DONE
stateStartTime: '2022-08-19T14:11:53.153155Z'
- details: Agent reported job success
state: RUNNING
stateStartTime: '2022-08-19T14:11:53.628725Z'
yarnApplications:
- name: Hail
progress: 1.0
state: FINISHED
trackingUrl: http://vcf-validation-m:8088/proxy/application_1660917791379_0001/
Process finished with exit code 0
I think its okay. I would be interested to see what the runtime would be with 1 worker and no secondary workers, just to see how it would perform on a single thread. I'd also be curious to know how big the VCF you were validating is (WES vs WGS and how many samples)
One thing we can do without hail is header validation, since we are already validating the sample IDs in the header. @mike-w-wilson will provide us with a list of the required info fields in the pipeline and then we can add a check that they are in the file to the validate VCF step.
@ShifaSZ we should validate that the following fields are in the VCF header: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO (containing AC,AN,AF), and FORMAT (containing AD, DP, GQ, GT)
The VCF header includes the meta information of the sub-fields (e.g,. AC, AN, AF of INFO) of the INFO
and FORMAT
. But there are no separate columns for these sub-fields. They locate in the variant rows instead of the VCF header. Here is an example of the VCF header and the sub-fields of the INFO
and FORMAT
fields.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
We will validate the metadata in the header for these subfields but won't validate them in the variant rows. Agree?
Since the impetus for this change was that we got a file missing AF, I do think it is important to check that the required INFO and FORMAT sub-fields are in the VCF. While its not perfect, I think we should read in the first non-header row of the VCF and parse the FORMAT and INFOP fields of just that first row to check if the sub-fields are there. But I agree that we should not try to validate every row
Do we need to validate the meta information for those sub-fields? The meta info is at the header, starting with ##
and containing the information of types and number of the value, etc. Once a time, we received SV VCF data with incorrect gnomAD AF type.
Can you share an example of what that meta info header looks like? We should definitely not check the values in the VCF themselves to validate the type, but if the meta info is easy to parse from the header it might be nice to validate it
Example of the meta info for the sub-fields of INFO and FORMAT:
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
I think validating that there is an entry in the meta info is better than checking the first row of the data. Its up to you if you want to validate they Type or not
FORMAT (containing AD, DP, GQ, GT)
Do we need to validate the FORMAT
fields based on variant type? e.g., VARIANT or SV?
we only support regular VARIANT VCFs in AnVIL loading, we do not support SVs at all
header validation has been added. Moving this back to blocked to track future work that requires hail
Closing out in favor of - https://github.com/broadinstitute/seqr-private/issues/1290
Add some genome build/ sample type/ chromosome validation at loading request time, instead of having it fail in the pipeline itself.
May need to wait for hail backend/ better ability to quickly run hail in seqr