Closed ssabnis closed 6 years ago
Hello @ssabnis! You'll have to be a bit more specific, human variation? cancer? metagenomics?
Hi @heuermh,
I appreciate your quick response, thank you.
I am looking for any large data set to do performance testing of the ADAM with the infrastructure that I have built for the analytics. This setup is going to be used by biotech companies. Can you recommend the data sets that I can use. I am thinking of cancer data.
Thanks
Hello @ssabnis
You might consider using one of the standard human genome references made available from the Genome In a Bottle project. I have been using NA12878 for some of my own performance benchmarking of ADAM. You can obtain the whole genome sequencing data from
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/10XGenomics
95GB BAM file "NA12878_phased_possorted_bam.bam" contains the sequencing records. Although this isn't "cancer data" it is a whole genome sequencing run of the human genome.
Thanks a lot @jondeaton , This is great help. I will start with this. If you know of any larger file size like 1 TB will help.
@jondeaton do I need uncompress the .gz file to run the ADAM?
do I need uncompress the .gz file to run the ADAM?
No, but unless the .gz file is Blocked Gzip format (BGZF) reading performance will not scale with number of executors, as regular Gzip format is not splittable.
You might also want to take a look at the datasets referenced in https://github.com/bcbio/bcbio_validations
Many different applications are discussed there.
Closing as resolved.
Hello,
I am trying to use ADAM in the infrastructure that I have built, I am looking a for a large data set that I can use to process using ADAM. Are there any open data available. ?
Any info much appreciated.
Thanks