Issue in load_dataset_to_es in v0.1

SimonLi5601 commented 5 years ago

I got a general out of index error in step2_export_to_elasticsearch as below, Caused by: java.lang.IndexOutOfBoundsException: 0 File "/opt/seqr/hail-elasticsearch-pipelines/hail_scripts/v01/utils/elasticsearch_client.py", line 373, in export_kt_to_elasticsearch kt.to_dataframe().show(n=5)

I debugged it a little bit, and found this line return an empty KeyTable. And I dig a little bit and found domains = va.domains[va.aIndex-1] in site_fields_list prevents the application to generate a KeyTable and thus df is empty as well. I had to manipulate the list to skip this field and was loaded successfully . I believe the split_multi() has already been run since I saw the message "Hail: WARN: called redundant split on an already split VDS" somewhere else. Any thoughts on this issue?

nawatts commented 5 years ago

It looks like generate_vds_make_table_arg assumes that if is_split_vds is true then every array field in the table contains one value for each alt allele. Depending on that dataset, that may not be true. https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/92526f1183e0c7054f6b7bb5ae688da6629a4417/hail_scripts/v01/utils/elasticsearch_utils.py#L206-L207

I think you could work around this by passing is_split_vds=False to export_vds_to_elasticsearch. This would require you to handle any array fields containing per-alt allele values yourself instead of relying on generate_vds_make_table_arg to do it automatically.

bw2 commented 5 years ago

This looks like the same issue as https://github.com/macarthur-lab/hail-elasticsearch-pipelines/issues/123 . Converting the Set to an Array triggers this downstream issue.

SimonLi5601 commented 5 years ago

This looks like the same issue as #123 . Converting the Set to an Array triggers this downstream issue.

I suspect these are two different issues though they are pointing to the same (similar) field of domains. I can see the schema here is still an Array data structure, the other one #123 was converted to a Set which caused the problem.

SimonLi5601 commented 5 years ago

It looks like generate_vds_make_table_arg assumes that if is_split_vds is true then every array field in the table contains one value for each alt allele. Depending on that dataset, that may not be true. hail-elasticsearch-pipelines/hail_scripts/v01/utils/elasticsearch_utils.py

Lines 206 to 207 in 92526f1

if is_split_vds and isinstance(field_type, basestring) and field_type.startswith("Array"): expr += "[va.aIndex-1]" I think you could work around this by passing is_split_vds=False to export_vds_to_elasticsearch. This would require you to handle any array fields containing per-alt allele values yourself instead of relying on generate_vds_make_table_arg to do it automatically.

I don't know what exactly happens here (I mean va.domains[va.aIndex-1], I do see domains filed is An Array in VDS schema). This step didn't throw any exception but return an empty KeyTable silently. And the exception was threw until the dataframe tries to show the first 5 rows which is due to a empty key table.

nawatts commented 5 years ago

And the exception was threw until the dataframe tries to show the first 5 rows which is due to a empty key table.

I doubt that's what is actually throwing the exception. An empty dataframe can be shown.

~$ pyspark
>>> from pyspark.sql.types import *
>>> field = [StructField("field1", StringType(), True)]
>>> schema = StructType(field)
>>> df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
>>> df.count()
0
>>> df.show(5)
+------+
|field1|
+------+
+------+

Spark transformations are lazy, so these expressions aren't actually run until something (like show) requires their results. So an index out of bounds error won't show up when select or annotate is first called since it's not immediately evaluating the select/annotate expression.

I don't know what exactly happens here (I mean va.domains[va.aIndex-1], I do see domains filed is An Array in VDS schema)

A VDS imported from a VCF may contain multi-allelic variants. In addition, some info fields may contain an array with one value per alt allele.

chrom	pos	alleles	info.AC
1	100	["A", "C"]	[20]
1	200	["C", "T", "G"]	[5, 10]

split_multi splits those multi-allelic variants into multiple rows, but it doesn't split other fields. So after split_multi, that example would look like:

chrom	pos	alleles	info.AC	aIndex	wasSplit
1	100	["A", "C"]	[20]	1	false
1	200	["C", "T"]	[5, 10]	1	true
1	200	["C", "G"]	[5, 10]	2	true

That's where the va.fields[va.aIndex - 1] annotations come in. They are used to filter the fields containing per alt-allele values down to the one value for that row's variant. So replacing info.AC with info.AC[va.aIndex - 1] results in:

chrom	pos	alleles	info.AC
1	100	["A", "C"]	[20]
1	200	["C", "T"]	5
1	200	["C", "G"]	10

So if your original dataset contains an array field (domains) that doesn't have one value per alt-allele, attempting to read domains[va.aIndex - 1] would result in an index out of bounds error.

bw2 commented 5 years ago

I agree with Nick that would be one thing that could cause this error. For load_clinvar_to_es specifically, I was able to reproduce the error with the current clinvar release downloaded by the script.

When I then changed this line

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/hail_scripts/v01/utils/computed_fields.py#L330

from .map( x => x.domains) back to .map( x => x.domains.map( domain => domain.db+":"+domain.name ) )

it got rid of the error, and ran to completion successfully.

@hanars @nawatts What do you think - should we go back to .map( x => x.domains.map( domain => domain.db+":"+domain.name ) ) or fix this another way?

SimonLi5601 commented 5 years ago

I agree with Nick that would be one thing that could cause this error. For load_clinvar_to_es specifically, I was able to reproduce the error with the current clinvar release downloaded by the script.

When I then changed this line

https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/master/hail_scripts/v01/utils/computed_fields.py#L330

from .map( x => x.domains) back to .map( x => x.domains.map( domain => domain.db+":"+domain.name ) )

it got rid of the error, and ran to completion successfully.

@hanars @nawatts What do you think - should we go back to .map( x => x.domains.map( domain => domain.db+":"+domain.name ) ) or fix this another way?

Hi Ben, sorry I just have time to verify this. Your fix did solve the issue in #123, but for this specific load_dataset_to_es script, it gives a different compliant. No doubt We will move forward to 0.2 in future, not sure if this is worthy for your team to look into this. BTW, I haven't tried the workaround mentioned above is_split_vds=False.

Hail version: 0.1-105a497 Error summary: HailException: String' has no field or methoddb' :3: .map( x => x.domains.map( domain => domain.db+":"+domain.name ))

nawatts commented 5 years ago

@hanars @nawatts What do you think - should we go back to .map( x => x.domains.map( domain => domain.db+":"+domain.name ) ) or fix this another way?

@knguyen142's comment on #131 identified with the issue with the ClinVar script.

get_expr_for_vep_protein_domains_set was originally designed to be called on the unaltered VEP transcript consequences, not the sorted transcript consequences created by get_expr_for_vep_sorted_transcript_consequences_array. That was changed in #87, which updated the call in load_dataset_to_es.py but not the call in load_clinvar_to_es.py.

@bw2

SimonLi5601 commented 5 years ago

I think the issue came from the VCF files before and @nawatts was right. After I moved away from that VCF, I don't have the issues any more. But I believe ClinVar script is a different issue as @bw2 already re-produced the issue. I will close this one. Thanks for all the helps!

broadinstitute / seqr-loading-pipelines

Issue in load_dataset_to_es in v0.1 #127