ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

How to support Variant calling #145

Closed skeenan closed 9 years ago

skeenan commented 10 years ago

Proposed topic for ASHG How to support Variant calling with the API?

ekg commented 10 years ago

If the API can provide a data stream that can be converted into BAM on the fly, most existing variant callers could run on it natively. On Sep 19, 2014 10:39 AM, "skeenan" notifications@github.com wrote:

Proposed topic for ASHG: How to supporting Variant calling with the API?

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/145.

lh3 commented 10 years ago

+1 to @ekg. Would be good if we can directly get a BAM stream from the API. Compressed BAM stream should be smaller than JSON responses by 5-10 folds.

pgrosu commented 10 years ago

This is basically the standard pipelining that I've been suggesting in a several posts, but does the FileFormat task team any preferences? This is to ensure that all the pieces fit together among the teams.

fnothaft commented 10 years ago

I think the discussion here should be more closely tied to #149. Getting a variant caller to run on top of the API should be fairly simple (as @ekg points out, go API->BAM), but doesn't really buy us much unless we're just calling variants on small regions.

Here's a situation where I think we could make an impact:

Essentially, I want to define an analysis pipeline, ship the pipeline to where the data is, and run it in situ, and then ship the gVCFs back for final joint calling. This application isn't a good fit for a REST API, but conceptually, we can leverage our consistent data formats to run directly on top of read stores. If we want to do this, we would need to tackle a lot of questions on the execution front, which is why I think this issue is intimately tied with #149.

maximilianh commented 10 years ago

Yes, we could use an API2Bam converter.

fnothaft commented 10 years ago

Perhaps we should rephrase the question that this issue brings up. For me, this evokes more of a meta-issue, asking what direction we see for the API. Specifically, what do we want the API to support? Currently, we've got a data search and access API, which works well for UI driven applications like visualization. However, would we want an API that is more compute geared? I.e., should we enable people to perform geographically distributed processing on top of data stored in GA4GH-compliant stores? If so, is our current API a good fit? If it isn't, how do we revise the API to move forward?

If we don't want to enable geographically distributed processing, a GA4GH REST API-to-BAM shim works just fine if people want to query for small slices of data, and then process it locally using their current pipelines. I think that's a good first step, but I think it's a lackluster long term goal.

In my mind, joint variant calling is just an application that demonstrates the utility of geographically distributed processing on top of this API. The REST API we've designed is not the best approach for this sort of processing. The performance of REST improves if you're running on an incredibly fast network and the cost of moving data over a wire is cheap, but REST still increases the protocol and ser/de latency of any IO operations. I've discussed this a bit with @cassiedoll and @massie and a few approaches we've thrown out have included:

Should we choose to tackle this more ambitious goal, there's a large API design space that exists in a continuum between more efficient/less backwards compatible and less efficient/more backwards compatible. There's also a lot of external issues to tackle, e.g., reproducibility, billing, etc, but I think of this issue as more of "what should the API for executing computation look like?" and #149 as more of "what does the execution environment track?".

ekg commented 10 years ago

@fnothaft Essentially all small variant calling is done in relatively short regions. Right now there is no pattern which lets me efficiently query a remote alignment set and produce a stream of alignments in a particular region. The current approach requires the local download of a BAM index and a high-latency seek on a remote file. If there were a supported and efficient mechanism to get a BAM file for a particular region from the API, I could employ existing techniques to run distributed joint calling or rapid-turnaround, highly-parallel variant calling. If the API could just provide this I would be very grateful.

So I completely disagree that it doesn't get us much. This would be a huge win.

And moreover, I think we should be cautious about trying to go much further, because although there is a lot of assertion about the right way to do things in this space, we are still trying to figure things out.

For example, gVCF appears to suffer from the same problem of the now-deprecated ReduceReads BAMs of sensitivity to non-SNP variants embedded in what might appear to be reference-matching runs of homozygosity. When we consider only a single sample, such regions could appear not to have variation (e.g. large indels) which would be apparent if we merge a lot of samples together (e.g. one sample has clear signal via alignment/assembly, but many others actually have read data supporting the variant). There does not appear to be any mechanism to exchange this information between samples if the distributed calling is done in a single pass as with ReduceReads and gVCF. For SNPs in deep samples this is probably not an issue, but other classes of variation as well as shallower samples seem likely to have issues. The only sure way to maintain sensitivity when doing distributed joint calling is via two passes over the data, the first to obtain candidate variants and the second to genotype them in all samples. This would imply a mechanism to share external data with the server.

My point of this discussion is to remind that we should be careful about trying to enable a specific use for the API that is embedded in a particular algorithmic approach. The question we should ask here is just how to expose the data for maximal usefulness to the community. I think it is absolutely clear that we need to be able to convert back and forth with existing designs or we will turn a lot of high-quality methods into legacy software and risk the usefulness of the API.

fnothaft commented 10 years ago

@ekg

@fnothaft Essentially all small variant calling is done in relatively short regions.

I'm aware of that. My point is that for calling a whole WGS sample:

Right now there is no pattern which lets me efficiently query a remote alignment set and produce a stream of alignments in a particular region. The current approach requires the local download of a BAM index and a high-latency seek on a remote file. If there were a supported and efficient mechanism to get a BAM file for a particular region from the API, I could employ existing techniques to run distributed joint calling or rapid-turnaround, highly-parallel variant calling. If the API could just provide this I would be very grateful.

Our API doesn't actually fix this problem. Yes, you are able to get shards from the API but if you do GA4GH-API-for-shards + GA4GH-to-BAM shim:

  1. Our API is still going to be very slow going across a network, speaking REST, pulling data off disk at the server, serializing data into GA4GH format, coming back, and de-serializing data from GA4GH format into BAM
  2. If you want shards at a size similar to what people are usually using for small window calling (let's say 10kb, which is larger than reality, IIRC), you need to request ~300k shards, which is not going to work well.
  3. REST is a cruddy serialization format; per @lh3's estimate above, (GA4GH in REST is 5-10x larger than compressed BAM), a 60x WGS sample moved via GA4GH REST is 1.125-2.25 TB of data moved over the wire.

And moreover, I think we should be cautious about trying to go much further, because although there is a lot of assertion about the right way to do things in this space, we are still trying to figure things out.

For example, gVCF appears to suffer from the same problem of the now-deprecated ReduceReads BAMs of sensitivity to non-SNP variants embedded in what might appear to be reference-matching runs of homozygosity. When we consider only a single sample, such regions could appear not to have variation (e.g. large indels) which would be apparent if we merge a lot of samples together (e.g. one sample has clear signal via alignment/assembly, but many others actually have read data supporting the variant). There does not appear to be any mechanism to exchange this information between samples if the distributed calling is done in a single pass as with ReduceReads and gVCF. For SNPs in deep samples this is probably not an issue, but other classes of variation as well as shallower samples seem likely to have issues. The only sure way to maintain sensitivity when doing distributed joint calling is via two passes over the data, the first to obtain candidate variants and the second to genotype them in all samples. This would imply a mechanism to share external data with the server.

No disagreement; that being said, it is reasonable to implement that data pattern without requiring all n samples to sit locally on a single server. You'd have to do an all-to-all transmission of the seen non-reference genotypes before doing a second pass and emitting a final gVCF, but that's (back-of-the-envelope) going to be much smaller than moving all of the gVCFs.

My point of this discussion is to remind that we should be careful about trying to enable a specific use for the API that is embedded in a particular algorithmic approach. The question we should ask here is just how to expose the data for maximal usefulness to the community. I think it is absolutely clear that we need to be able to convert back and forth with existing designs or we will turn a lot of high-quality methods into legacy software and risk the usefulness of the API.

Sure, but we can accomplish that with this approach:

  • Supporting a layer that allows people to submit legacy-style (e.g., accept BAM/CRAM/ADAM/GoogleGenomics input formats) jobs which then request data shards from the GA4GH-compliant data store and execute locally on top of the store

The big thing is, you need to move the computation local* to the data for this approach to work if you are processing large datasets. The current API is more efficient than FTP if you're only touching a small segment of the dataset, because the indexed retrieval and slicing shows large gains even with the REST overhead.

* Caveat: you don't need to be local if you have insanely fat pipes, and an extremely optimized serving infrastructure. As I understand, Google does use the GA4GH API to request shards across multiple datacenters. However, I'd be loathe to generalize that what works well inside of Google will work well outside of Google. They've spent incredible amounts of time to optimize both communication between datacenters and their serving architecture (among other things), and we shouldn't pretend that we'll get performance that approaches theirs.

max-biodatomics commented 10 years ago

Hi, I would like to add suggestions from our implementation here.

We are using FUSE (file system in user space) for "On-the-Fly" conversion data from our internal format to common formats (BAM, SAM, VCF). FUSE client can be created for any platform (linux, Mac, Windows). We are using a java implementation of FUSE. Our current implementation is working through SQL query and connectors to a query engine (we can use, Impala, Hive and working on implementation of Spark SQL now). This can be easily re-written to use RESTful API. So, basically you can convert to a required format on the fly using a standard GA API. We are planning to make our implementation open source (Apache 2 license).

Still, Frank made a good point that It is good only for visualizations. You cannot analyse all data due to network bandwith issues. He suggested a solution to move analysis to data. I think it is the only viable solution.

The developing API for this will be a good point. We have several prototypes for it but we don't have a final solution yet.

max-biodatomics commented 10 years ago

It could be a good point to start a new Project for tools/workflows distribution API.

adamnovak commented 10 years ago

So it looks like we want two things:

1) Sliced compressed BAM export, for downloading everyone's reads in a small region in a kind-of-efficient way, to support centralized joint variant calling on the client.

2) A way to push variant calling onto the server, to support joint variant calling across distributed data that's too big to move. This is very related to #149.

max-biodatomics commented 10 years ago

1) Sliced compressed BAM export, for downloading everyone's reads in a small region in a kind-of-efficient way, to support centralized joint variant calling on the client.

My point is here that for implementation of it we need to make only FUSE client which will use GA API. Nothing extra is required. JSON could be transferred in compressed way between server and client.

pgrosu commented 10 years ago

Frank, I agree we should discuss #149 thoroughly, but even with huge local bandwidth you have to be optimized in your design and I/O implementation (i.e. Protocol Buffers among others). For instance, take a look at Facebook's Presto, which they released for free on GitHub here:

https://github.com/facebook/presto

Below is a link describing more their approach on leveraging different file systems, while querying on petabytes of data:

https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

http://prestodb.io/

fnothaft commented 10 years ago

@pgrosu we're standardizing an interface, not an implementation.

pgrosu commented 10 years ago

@fnothaft, I understand but we have to adjust our interface to what is possible, otherwise if it's too general then we end up going in circles. I agree with the approach of large-scale integration and analysis, since it's my preferred way as well - and it covers the small cases too. But we should explore the what-if scenarios, just was you started above regarding design approaches, bandwidth and impact on our goals (sharing/processing/analysis/etc).

lh3 commented 10 years ago

For samtools, @ekg's freebayes, and probably GATK, we can call variants with something like:

curl http://url/to/aln.bam | samtools mpileup -Euf ref.fa - | bcftools view -v - > var.vcf

If our APIs can give a BAM stream, we will be able to call variants right now. This method also works for a few samples if they are mixed in one BAM stream and for gVCF-based calling, though it is not suitable for all calling patterns.

Compressed JSON response is probably not as compressible as a BAM stream, but is usually fine as long as we have a right tool to generate a BAM stream from the JSON response. By "right", I mean it should be a command line tool, portable, easy to install, easy to use and not requiring root permissions.

diekhans commented 10 years ago

Agreed that if the API could return BAM format via HTTP it would greatly accelerated its acceptance. Support for requesting sub-ranges and returning them in BAM format (aka BAM slicing) would also have great utility.

Mark

Heng Li notifications@github.com writes:

For samtools, @ekg's freebayes, and probably GATK, we can call variants with something like:

curl http://url/to/aln.bam | samtools mpileup -Euf ref.fa - | bcftools view -v - > var.vcf

If our APIs can give a BAM stream, we will be able to call variants right now. This method also works for a few samples if they are mixed in one BAM stream and for gVCF-based calling, though it is not suitable for all calling patterns.

Compressed JSON response is probably not as compressible as a BAM stream, but is usually fine as long as we have a right tool to generate a BAM from the JSON response. By "right", I mean it should be a command line tool, portable, easy to install, easy to use and not requiring root permissions.

— Reply to this email directly or view it on GitHub.*

dglazer commented 10 years ago

I hear several overlapping discussions in this thread: 1) what does our API need so it's possible to build a variant caller on top of it? 2) what does our API need so it's easy to build a variant caller on top of it? 3) what does our API need so it's efficient to build a variant caller on top of it?

For (1), I believe it's already possible, since the API lets you retrieve all the meaningful information from a BAM file. It might be a hassle (e.g. because you have to patch your legacy variant-calling code) and it might be slow (e.g. because it moves too much data over the wire), but it should work. If anyone knows of any gaps, let's file them as issues.

For (2), @ekg and @lh3 point out that one way to make it easier to retrofit existing tools would be to have an option for the API to return data directly in BAM format. I'm not sure I understand exactly what you have in mind, but what I picture is an option you could pass to POST /search/reads that, instead of returning a JSON GASearchReadsResponse, returns a binary blob that looks like it came from a BAM file. Is that right? If so, I think it should be easy for someone to prototype the interface with a client-side wrapper, and verify that it is actually easier to incorporate into callers than today's interface. (I'm not convinced it would be, since you still have to make sure the logic knows how to pull data in multiple requests, but maybe I'm misunderstanding the suggestion, and I've never built a variant caller.)

For (3), I hear two paths being suggested: (3a) choose a more efficient wire format -- when we think that's an important bottleneck, I suggest creating a new issue focused solely on wire format. Off the top of my head, options to discuss include gzipped JSON, BAM-like, Avro's preferred binary encoding (I assume there is one?), and proto.

(3b) don't move the data at all; move the code to the data instead -- I agree with @fnothaft we should keep that discussion in #149.

richarddurbin commented 10 years ago

With respect to (2) I don't think that @ekg and @lh3 are asking for the API to return data in BAM format. I think they are asking for someone to write a client side wrapper that retrieves data via the API and outputs it in BAM. This would not need to be incorporated into current callers - it could just pipe a stream into them.
So if this was done then all current callers that read BAM could run. I think it is worth it to the GA4GH API community to produce this.

With respect to (3) I think we will want to move to a more efficient transfer protocol once the principles are established and we are using this in anger. I would say we should Avro binary encoding and/or proto, not BAM. Does gzipped JSON require conversion to text and back? If so that doesn't seem ideal either.
I agree that moving at least some of the calculation server side is probably the right way to go in the longer term.

Richard

On 22 Sep 2014, at 04:41, David Glazer notifications@github.com wrote:

I hear several overlapping discussions in this thread: 1) what does our API need so it's possible to build a variant caller on top of it? 2) what does our API need so it's easy to build a variant caller on top of it? 3) what does our API need so it's efficient to build a variant caller on top of it?

For (1), I believe it's already possible, since the API lets you retrieve all the meaningful information from a BAM file. It might be a hassle (e.g. because you have to patch your legacy variant-calling code) and it might be slow (e.g. because it moves too much data over the wire), but it should work. If anyone knows of any gaps, let's file them as issues.

For (2), @ekg and @lh3 point out that one way to make it easier to retrofit existing tools would be to have an option for the API to return data directly in BAM format. I'm not sure I understand exactly what you have in mind, but what I picture is an option you could pass to POST /search/reads that, instead of returning a JSON GASearchReadsResponse, returns a binary blob that looks like it came from a BAM file. Is that right? If so, I think it should be easy for someone to prototype the interface with a client-side wrapper, and verify that it is actually easier to incorporate into callers than today's interface. (I'm not convinced it would be, since you still have to make sure the logic knows how to pull data in multiple requests, but maybe I'm misunderstanding the suggestion, and I've never built a variant caller.)

For (3), I hear two paths being suggested: (3a) choose a more efficient wire format -- when we think that's an important bottleneck, I suggest creating a new issue focused solely on wire format. Off the top of my head, options to discuss include gzipped JSON, BAM-like, Avro's preferred binary encoding (I assume there is one?), and proto.

(3b) don't move the data at all; move the code to the data instead -- I agree with @fnothaft we should keep that discussion in #149.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

pgrosu commented 10 years ago

It's been some time since I looked at the source code for GATK, but looking at the UnifiedGenotyper.java, if I read this correctly, after the initialization step, the map phase seems to just be calling an instance of UnifiedGenotypingEngine.java:

...
// the calculation arguments
private UnifiedGenotypingEngine genotypingEngine = null;
...
public List<VariantCallContext> map(RefMetaDataTracker tracker, ReferenceContext refContext, AlignmentContext rawContext) {
   return genotypingEngine.calculateLikelihoodsAndGenotypes(tracker, refContext, rawContext);
}
...

In the reduce, phase if one does not want a subset of the variants, then the calls are just a list of VariantCallContext generated by map above:

public UGStatistics reduce(List<VariantCallContext> calls, UGStatistics sum) { 
 ...
   writer.add(...);
 ...
}

If it's so, then we probably don't even need BAM files. In this instance, we just need to format the alignments and reference genome to conform with the what AlignmentContext and ReferenceContext is expected to be structured as. Then the data can be a stream from an URI. This would be the GATK approach, but we should explore other ways as well before making a decision.

lh3 commented 10 years ago

For (2), direct BAM output would be more convenient, but an efficient and easy-to-use API-to-BAM adapter is good enough. We might ask @jrobinso for help. IGV can interact with API v0.1 now. I guess he is reading JSON into SAMRecord java objects internally.

For (3b), it is good to consider close-to-data computation, but supporting local analysis on users' computer is more important in the short term. All we need is a BAM stream, which does not seem difficult technically.

cassiedoll commented 10 years ago

We're actually already working on a direct GATK integration - see https://github.com/samtools/htsjdk/pull/98

ekg commented 10 years ago

@lh3, (3b): If we can support stream-based processing, then we can run our analyses anywhere, either on our local machines or on virtual hosts or environments running in the same datacenter or even on the same server as the data sits.

As an added benefit, this pattern makes it much easier for us to compose procedures built from many, disparate components.

On Mon, Sep 22, 2014 at 11:37 AM, cassiedoll notifications@github.com wrote:

We're actually already working on a direct GATK integration - see samtools/htsjdk#98 https://github.com/samtools/htsjdk/pull/98

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/145#issuecomment-56391917.

fnothaft commented 10 years ago

@ekg +1 I'd go a step further and present an iterator instead of a stream. It'd be slightly more general, in the sense that you can build both map-reduce style processing and streaming processing on top of iterators.

haussler commented 10 years ago

Chris, are you following this discussion as well? -D

On Sun, Sep 21, 2014 at 2:28 PM, Mark Diekhans notifications@github.com wrote:

Agreed that if the API could return BAM format via HTTP it would greatly accelerated its acceptance. Support for requesting sub-ranges and returning them in BAM format (aka BAM slicing) would also have great utility.

Mark

Heng Li notifications@github.com writes:

For samtools, @ekg's freebayes, and probably GATK, we can call variants with something like:

curl http://url/to/aln.bam | samtools mpileup -Euf ref.fa - | bcftools view -v - > var.vcf

If our APIs can give a BAM stream, we will be able to call variants right now. This method also works for a few samples if they are mixed in one BAM stream and for gVCF-based calling, though it is not suitable for all calling patterns.

Compressed JSON response is probably not as compressible as a BAM stream, but is usually fine as long as we have a right tool to generate a BAM from the JSON response. By "right", I mean it should be a command line tool, portable, easy to install, easy to use and not requiring root permissions.

— Reply to this email directly or view it on GitHub.*

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/145#issuecomment-56313416.

pgrosu commented 10 years ago

@lh3, that would be great to know out once the API-to-BAM adapter is available for testing.

@cassiedoll, this is great! It would be helpful to compare with @lh3's adapter as a side-by-side benchmark comparison.

@ekg and @fnothaft, those would be some great approaches! @fnothaft, would avacado be something we can have configured for at least performing variant calling using a stream from the API?

lh3 commented 10 years ago

To clarify, I am not the right person to implement the adapter. It is best written in java which I am not familiar with. As another note, the GATK developers are moving more and more components to htslib/C++. They are evaluating the possibility of implementing the next major algorithm in C++. Diving too much into the current GATK might not be a long-term solution.

jrobinso commented 10 years ago

Heng, yes I am using the webservice to fetch the json, then creating a SAMRecord type object. I first abstracted IGV's SAM record class to decouple it from the sam-jdk "SAMRecord" object, then implemented a new sam record type class (Ga4ghAlignment) for the ga4gh service. All the code is available here: https://github.com/broadinstitute/IGV/tree/master/src/org/broad/igv/ga4gh. It was done quickly and subject to further refactoring but it works.
This approach could be implemented in htsjdk itself quite easily,
returning to the user SAMRecord objects.

For (2), direct BAM output would be more convenient, but an efficient and easy-to-use API-to-BAM adapter is good enough. We might ask @jrobinso https://github.com/jrobinso for help. IGV can interact with API v0.1 now. I /guess/ he is reading JSON into SAMRecord java objects internally.

cassiedoll commented 10 years ago

Closing in favor of #149

ekg commented 10 years ago

I think that supporting streaming variant calling based on the API isn't the same as #149. Suggest we reopen.

On Wed, Oct 1, 2014 at 3:34 PM, cassiedoll notifications@github.com wrote:

Closing in favor of #149 https://github.com/ga4gh/schemas/issues/149

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/145#issuecomment-57522231.

benedictpaten commented 10 years ago

I agree this is a smaller scope topic that we could address independently of #145

Benedict (from phone) On Oct 1, 2014 7:01 PM, "Erik Garrison" notifications@github.com wrote:

I think that supporting streaming variant calling based on the API isn't the same as #149. Suggest we reopen.

On Wed, Oct 1, 2014 at 3:34 PM, cassiedoll notifications@github.com wrote:

Closing in favor of #149 https://github.com/ga4gh/schemas/issues/149

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/145#issuecomment-57522231.

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/145#issuecomment-57570619.

lh3 commented 10 years ago

I agree with @ekg and @benedictpaten.

dglazer commented 10 years ago

I'm fine with a separate topic, but this thread has covered many different things; be good to focus. Per my earlier https://github.com/ga4gh/schemas/issues/145#issuecomment-56325997, possible sub-topics include: (2) building an API-to-BAM adaptor, either purely client-side (as @richarddurbin suggests) or with some server-side API enhancements (3a) choosing a more efficient wire format

@ekg @benedictpaten @lh3 -- if either or both of those are what you have in mind, I suggest creating a new topic focused specifically on that. (Since there are other ways to do variant calling against the API, including just adapting today's variant callers as is in-flight in the Picard source, and providing an execution environment as is being discussed in #149.)

lh3 commented 10 years ago

While I like to see (2) in action first, I am not only having (2) and (3a) in mind. The value of this thread is to have other possibilities on the table, which raises questions such as what we want, how we prioritize and who will implement. All these deserve further discussion in ASHG. As to focus, #149 is actually much broader and will cover far more things when it gets expanded.

lh3 commented 10 years ago

I need to add that GA4GH APIs are mainly (or only?) used in visualization so far. Supporting variant calling in short term would be an even stronger showcase. This thread is really important.

fnothaft commented 10 years ago

While I like to see (2) in action first, I am not only having (2) and (3a) in mind. The value of this thread is to have other possibilities on the table, which raises questions such as what we want, how we prioritize and who will implement.

What do you have in mind beyond 2 and 3a/b? The main reason we marked this as a duplicate topic was that the 3a/b proposals overlap very closely with the execution environment (#149).

cassiedoll commented 10 years ago

I was hoping this could be a sub-topic of #149 to save time at ASHG. However, I agree that the solution to this issue is important, and you are all correct in that we shouldn't forget about it. Reopening now.

But we do need to collapse some of our ASHG topics down a bit as there are probably too many to get them all covered at the meeting. Like Frank mentioned, the solution to #149 could go a long way to solving this issue (and in fact the execution environment should be figured out first so we don't do something too variant calling specific here)

cassiedoll commented 10 years ago

(ps to @lh3 - we are using the ga4gh APIs in batch processing, with both spark and dataflow. so they are not only used in visualization.)

fnothaft commented 10 years ago

@cassiedoll This may be a bit more of a discussion for offline, but how are you using the API from Spark? Did you create a GA4GH input format?

lh3 commented 10 years ago

(ps to @lh3 - we are using the ga4gh APIs in batch processing, with both spark and dataflow. so they are not only used in visualization.)

Sounds great! Thank you.

cassiedoll commented 10 years ago

@fnothaft - that spark link above goes to our repo. wip for sure - but the code is functional. PCA on 1kg currently takes about 10 wall time hours with 40 4 core machines. No perf work has been done yet, so that's our next focus. There's a lot of low hanging fruit, so I expect drastic speed up times.

Unfortunately, all dataflow code is currently private, we'll share just as soon as we can. That code hits reads as well as variants.

pgrosu commented 10 years ago

We should definitely take more advantage of GPU-processing approaches, since it can fill a gap that MPI/Hadoop/etc. can't touch at "low" cost where speed can be achieved.

delagoya commented 9 years ago

Closing discussion.