Closed SimonLi5601 closed 5 years ago
It looks like generate_vds_make_table_arg
assumes that if is_split_vds
is true then every array field in the table contains one value for each alt allele. Depending on that dataset, that may not be true.
https://github.com/macarthur-lab/hail-elasticsearch-pipelines/blob/92526f1183e0c7054f6b7bb5ae688da6629a4417/hail_scripts/v01/utils/elasticsearch_utils.py#L206-L207
I think you could work around this by passing is_split_vds=False
to export_vds_to_elasticsearch
. This would require you to handle any array fields containing per-alt allele values yourself instead of relying on generate_vds_make_table_arg
to do it automatically.
This looks like the same issue as https://github.com/macarthur-lab/hail-elasticsearch-pipelines/issues/123 . Converting the Set to an Array triggers this downstream issue.
This looks like the same issue as #123 . Converting the Set to an Array triggers this downstream issue.
I suspect these are two different issues though they are pointing to the same (similar) field of domains. I can see the schema here is still an Array data structure, the other one #123 was converted to a Set which caused the problem.
It looks like
generate_vds_make_table_arg
assumes that ifis_split_vds
is true then every array field in the table contains one value for each alt allele. Depending on that dataset, that may not be true. hail-elasticsearch-pipelines/hail_scripts/v01/utils/elasticsearch_utils.pyLines 206 to 207 in 92526f1
if is_split_vds and isinstance(field_type, basestring) and field_type.startswith("Array"): expr += "[va.aIndex-1]" I think you could work around this by passing
is_split_vds=False
toexport_vds_to_elasticsearch
. This would require you to handle any array fields containing per-alt allele values yourself instead of relying ongenerate_vds_make_table_arg
to do it automatically.
I don't know what exactly happens here (I mean va.domains[va.aIndex-1], I do see domains filed is An Array in VDS schema). This step didn't throw any exception but return an empty KeyTable silently. And the exception was threw until the dataframe tries to show the first 5 rows which is due to a empty key table.
And the exception was threw until the dataframe tries to show the first 5 rows which is due to a empty key table.
I doubt that's what is actually throwing the exception. An empty dataframe can be shown.
~$ pyspark
>>> from pyspark.sql.types import *
>>> field = [StructField("field1", StringType(), True)]
>>> schema = StructType(field)
>>> df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
>>> df.count()
0
>>> df.show(5)
+------+
|field1|
+------+
+------+
Spark transformations are lazy, so these expressions aren't actually run until something (like show
) requires their results. So an index out of bounds error won't show up when select or annotate is first called since it's not immediately evaluating the select/annotate expression.
I don't know what exactly happens here (I mean va.domains[va.aIndex-1], I do see domains filed is An Array in VDS schema)
A VDS imported from a VCF may contain multi-allelic variants. In addition, some info fields may contain an array with one value per alt allele.
chrom | pos | alleles | info.AC |
---|---|---|---|
1 | 100 | ["A", "C"] | [20] |
1 | 200 | ["C", "T", "G"] | [5, 10] |
split_multi
splits those multi-allelic variants into multiple rows, but it doesn't split other fields. So after split_multi
, that example would look like:
chrom | pos | alleles | info.AC | aIndex | wasSplit |
---|---|---|---|---|---|
1 | 100 | ["A", "C"] | [20] | 1 | false |
1 | 200 | ["C", "T"] | [5, 10] | 1 | true |
1 | 200 | ["C", "G"] | [5, 10] | 2 | true |
That's where the va.fields[va.aIndex - 1]
annotations come in. They are used to filter the fields containing per alt-allele values down to the one value for that row's variant. So replacing info.AC
with info.AC[va.aIndex - 1]
results in:
chrom | pos | alleles | info.AC |
---|---|---|---|
1 | 100 | ["A", "C"] | [20] |
1 | 200 | ["C", "T"] | 5 |
1 | 200 | ["C", "G"] | 10 |
So if your original dataset contains an array field (domains) that doesn't have one value per alt-allele, attempting to read domains[va.aIndex - 1]
would result in an index out of bounds error.
I agree with Nick that would be one thing that could cause this error. For load_clinvar_to_es specifically, I was able to reproduce the error with the current clinvar release downloaded by the script.
When I then changed this line
from
.map( x => x.domains)
back to
.map( x => x.domains.map( domain => domain.db+":"+domain.name ) )
it got rid of the error, and ran to completion successfully.
@hanars @nawatts What do you think - should we go back to .map( x => x.domains.map( domain => domain.db+":"+domain.name ) )
or fix this another way?
I agree with Nick that would be one thing that could cause this error. For load_clinvar_to_es specifically, I was able to reproduce the error with the current clinvar release downloaded by the script.
When I then changed this line
from
.map( x => x.domains)
back to.map( x => x.domains.map( domain => domain.db+":"+domain.name ) )
it got rid of the error, and ran to completion successfully.
@hanars @nawatts What do you think - should we go back to
.map( x => x.domains.map( domain => domain.db+":"+domain.name ) )
or fix this another way?
Hi Ben, sorry I just have time to verify this. Your fix did solve the issue in #123, but for this specific load_dataset_to_es script, it gives a different compliant. No doubt We will move forward to 0.2 in future, not sure if this is worthy for your team to look into this. BTW, I haven't tried the workaround mentioned above is_split_vds=False.
Hail version: 0.1-105a497
Error summary: HailException: String' has no field or method
db'
:3: .map( x => x.domains.map( domain => domain.db+":"+domain.name ))
@hanars @nawatts What do you think - should we go back to .map( x => x.domains.map( domain => domain.db+":"+domain.name ) ) or fix this another way?
@knguyen142's comment on #131 identified with the issue with the ClinVar script.
get_expr_for_vep_protein_domains_set
was originally designed to be called on the unaltered VEP transcript consequences, not the sorted transcript consequences created by get_expr_for_vep_sorted_transcript_consequences_array
. That was changed in #87, which updated the call in load_dataset_to_es.py
but not the call in load_clinvar_to_es.py
.
@bw2
I think the issue came from the VCF files before and @nawatts was right. After I moved away from that VCF, I don't have the issues any more. But I believe ClinVar script is a different issue as @bw2 already re-produced the issue. I will close this one. Thanks for all the helps!
I got a general out of index error in step2_export_to_elasticsearch as below, Caused by: java.lang.IndexOutOfBoundsException: 0 File "/opt/seqr/hail-elasticsearch-pipelines/hail_scripts/v01/utils/elasticsearch_client.py", line 373, in export_kt_to_elasticsearch kt.to_dataframe().show(n=5)
I debugged it a little bit, and found this line return an empty KeyTable. And I dig a little bit and found domains = va.domains[va.aIndex-1] in site_fields_list prevents the application to generate a KeyTable and thus df is empty as well. I had to manipulate the list to skip this field and was loaded successfully . I believe the split_multi() has already been run since I saw the message "Hail: WARN: called redundant split on an already split VDS" somewhere else. Any thoughts on this issue?