Timeout waiting for connection from pool for 1000 genomes vcf on AWS

bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

Apache License 2.0

1k stars 308 forks source link

Timeout waiting for connection from pool for 1000 genomes vcf on AWS #1951

Open akmorrow13 opened 6 years ago

akmorrow13 commented 6 years ago

val x = sc.loadGenotypes("s3a://1000genomes/phase1/analysis_results/integrated_call_sets/ALL.chr17.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz")

generates error Unable to execute HTTP request: Timeout waiting for connection from pool with net.fnothaft:jsr203-s3a:0.0.2.

This error was tested with Hadoop-BAM 7.9.2 and 7.9.1

fnothaft commented 6 years ago

Sigh, I am seeing this too...

akmorrow13 commented 6 years ago

@fnothaft how are you running? Are you on EMR or through toil on standard aws instances? Apparently EMR dropped support for s3a. However, I can still loadAlignments from s3a, but not vcfs. Fortunately, s3 works just fine for vcfs (but is sloww)

heuermh commented 6 years ago

Apparently EMR dropped support for s3a.

When did that happen? And at a specific version of EMR?

Fortunately, s3 works just fine for vcfs (but is sloww)

Practically, conductor is still a good solution for s3 → HDFS, and is faster than s3-dist-cp. Conductor can't upload directories of Parquet+Avro from HDFS → s3 though, so you'd need to fall back to s3-dist-cp for that.

akmorrow13 commented 6 years ago

I'm not sure when s3a was dropped from. @delagoya may know more, as they were my informant.

fnothaft commented 6 years ago

Are you able to use s3n?

delagoya commented 6 years ago

I am researching with the EMR team about what is the supported URL encodings.

dstockstad commented 6 years ago

Was just passing through. Hopefully everyone has seen this page but linking just in case: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Interesting that s3:// on EMR is slower than s3a:// considering EMRFS (EMR's proprietary S3 impl) is one of it's selling points. You might be able to use s3a URL's consistently by setting the following parameters:

<property>
  <name>fs.s3a.impl</name>
  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  <description>The implementation class of the S3A Filesystem</description>
</property>

<property>
  <name>fs.AbstractFileSystem.s3a.impl</name>
  <value>org.apache.hadoop.fs.s3a.S3A</value>
  <description>The implementation class of the S3A AbstractFileSystem.</description>
</property>

Link: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

This is all untested but I might give this a whirl when I get a moment and see if I can get this working and post results here.

heuermh commented 6 years ago

@dstockstad Thanks for the note! Where do those properties need to be specified?

dstockstad commented 6 years ago

You're going to want to do it using the instructions here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html

The settings go into core-site. So something like this:

[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
      "fs.AbstractFileSystem.s3a.impl": "org.apache.hadoop.fs.s3a.S3A",
    }
  }
]

Keep in mind that I still have not actually verified this so can't say for sure whether it will work and might also need additional configuration.