bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
1k stars 309 forks source link

Data Sets #2043

Closed ssabnis closed 6 years ago

ssabnis commented 6 years ago

Hello,

I am trying to use ADAM in the infrastructure that I have built, I am looking a for a large data set that I can use to process using ADAM. Are there any open data available. ?

Any info much appreciated.

Thanks

heuermh commented 6 years ago

Hello @ssabnis! You'll have to be a bit more specific, human variation? cancer? metagenomics?

ssabnis commented 6 years ago

Hi @heuermh,

I appreciate your quick response, thank you.

I am looking for any large data set to do performance testing of the ADAM with the infrastructure that I have built for the analytics. This setup is going to be used by biotech companies. Can you recommend the data sets that I can use. I am thinking of cancer data.

Thanks

jondeaton commented 6 years ago

Hello @ssabnis

You might consider using one of the standard human genome references made available from the Genome In a Bottle project. I have been using NA12878 for some of my own performance benchmarking of ADAM. You can obtain the whole genome sequencing data from

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/10XGenomics

95GB BAM file "NA12878_phased_possorted_bam.bam" contains the sequencing records. Although this isn't "cancer data" it is a whole genome sequencing run of the human genome.

ssabnis commented 6 years ago

Thanks a lot @jondeaton , This is great help. I will start with this. If you know of any larger file size like 1 TB will help.

ssabnis commented 6 years ago

@jondeaton do I need uncompress the .gz file to run the ADAM?

heuermh commented 6 years ago

do I need uncompress the .gz file to run the ADAM?

No, but unless the .gz file is Blocked Gzip format (BGZF) reading performance will not scale with number of executors, as regular Gzip format is not splittable.

heuermh commented 6 years ago

You might also want to take a look at the datasets referenced in https://github.com/bcbio/bcbio_validations

Many different applications are discussed there.

heuermh commented 6 years ago

Closing as resolved.