bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
997 stars 309 forks source link

Consistency on ShuffleRegionJoin returns #1734

Closed devin-petersohn closed 7 years ago

devin-petersohn commented 7 years ago

I would like to make consistent the return types for all ShuffleRegionJoins, specifically to return GenericGenomicRDDs.

fnothaft commented 7 years ago

You mean, GenericGenomicRDD as opposed to GenomicRDD? Can you expand upon your motivations, pros/cons, etc?

devin-petersohn commented 7 years ago

They should be consistent for proper generalization downstream. In 37b971a3afb80598ffecd5275aff6c3ceaa96ef3 it was changed only for inner shuffle join, and I think that change makes sense to propagate through to the remaining shuffle joins.

fnothaft commented 7 years ago

Ah, I see what's going on. Yeah, that change sounds OK to me. Do you know if we can move the inner join back down to GenomicRDD instead? That'd be my preference.

devin-petersohn commented 7 years ago

I actually prefer GenericGenomicRDD more here (or some kind of GenomicPairRDD object) here because of the strange type-casting issue referenced in #1735. It is also a much cleaner abstraction than GenomicRDD[(T, Iterable[X]), Z].

heuermh commented 7 years ago

Still trying to get my head around this.

Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?

Could the methods in GenericGenomicRDD be moved up to GenomicRDD (would require traitabstract class)?

Would GenomicPairRDD limit joins to two types? It seems a completely reasonable use case to join reads, features, and variants. Or may be more efficient to join twice two of the same type than to union first then join (say variants and several different RDDs of features).

fnothaft commented 7 years ago

Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?

That's exactly their purpose.

Conceptually, we could also support GenericGenomicRDD as a type that one can transmute to, which I think would be reasonable. We don't support this now, but its just a matter of adding the conversion functions.

Could the methods in GenericGenomicRDD be moved up to GenomicRDD (would require trait → abstract class)?

The ones that call copy require GenericGenomicRDD to be a case class, union requires a concrete constructor/apply method, and the getReferenceRegions implementation doesn't make sense for non-generic GenomicRDDs.

Would GenomicPairRDD limit joins to two types? It seems a completely reasonable use case to join reads, features, and variants. Or may be more efficient to join twice two of the same type than to union first then join (say variants and several different RDDs of features).

Are you proposing to replace GenericGenomicRDD with GenomicPairRDD, which would be a GenomicRDD that enforces that the RDD's value is a tuple? If so, what would be the benefit to this over GenericGenomicRDD?

devin-petersohn commented 7 years ago

Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?

That's exactly their purpose.

Downstream, we need a way to manage the various set theory operations without modifying the internal records, changing the regionFn to be whatever we need.

Conceptually, we could also support GenericGenomicRDD as a type that one can transmute to, which I think would be reasonable. We don't support this now, but its just a matter of adding the conversion functions.

That would be useful considering BroadcastRegionJoin and ShuffleRegionJoin group by methods do not return the same tuple ordering. Being able to swap key and value is useful.

Are you proposing to replace GenericGenomicRDD with GenomicPairRDD, which would be a GenomicRDD that enforces that the RDD's value is a tuple? If so, what would be the benefit to this over GenericGenomicRDD?

The only additional benefit I see is calling ._1/._2 to get the GenomicRDD from the left/right side of the join (i.e. the AlignmentRecordRDD from the left). The use case would be post processing on the set theory operation (i.e. I only want the alignments that overlapped my variants in the output). I have no strong preference between keeping it as is or adding the GenomicPairRDD.

fnothaft commented 7 years ago

That would be useful considering BroadcastRegionJoin and ShuffleRegionJoin group by methods do not return the same tuple ordering. Being able to swap key and value is useful.

You can already do that with transform, and you can transmute from a GenericGenomicRDD to a non-generic. We are just missing conversion functions for going to a GenericGenomicRDD.

The only additional benefit I see is calling ._1/._2 to get the GenomicRDD from the left/right side of the join (i.e. the AlignmentRecordRDD from the left). The use case would be post processing on the set theory operation (i.e. I only want the alignments that overlapped my variants in the output). I have no strong preference between keeping it as is or adding the GenomicPairRDD.

See above; you can already do this with transform or transmute on a GenericGenomicRDD. GenomicPairRDD wouldn't make this easier.

devin-petersohn commented 7 years ago

You can already do that with transform, and you can transmute from a GenericGenomicRDD to a non-generic. We are just missing conversion functions for going to a GenericGenomicRDD.

I may be mistaken, but don't you need transmute to go from GenericGenomicRDD[(U, T)] to GenericGenomicRDD[(T, U)]? The transform requires a function T => T, or (U, T) => (U, T). Using transform didn't compile when I tried to go from (U, T) => (T, U).

See above; you can already do this with transform or transmute on a GenericGenomicRDD. GenomicPairRDD wouldn't make this easier.

It was meant to be more of a user-facing feature, rather than forcing the user to: 1.) know what type of GenomicRDD they had, and 2.) use the transmute API. I can just add that functionality to Lime if it doesn't belong here.

fnothaft commented 7 years ago

I may be mistaken, but don't you need transmute to go from GenericGenomicRDD[(U, T)]

Yes, you are correct; my apologies for the confusion. My broader point is that we don't need a new class to support this.

It was meant to be more of a user-facing feature, rather than forcing the user to: 1.) know what type of GenomicRDD they had, and 2.) use the transmute API.

Ah, are you saying you'd add a function to GenomicPairRDD that would call ._1 or ._2 on all elements in the RDD? Something like .keys or .values in Spark's PairRDDFunctions? If so, that makes sense, but I'd prefer to go with Spark's approach where the functions are added implicitly to any GenericGenomicRDD[(X,Y)] (and implemented via transmute), instead of replacing GenericGenomicRDD with PairGenomicRDD.

heuermh commented 7 years ago

Can you sketch out the two use cases above? Does having GenericGenomicRDD/GenomicPairRDD as return value help here?

val reads = sc.loadReads("sample.bam")
val variants = sc.loadVariants("sample.vcf.gz")
val features = sc.loadFeatures("ensembl.gff3.gz")

// 1. join all three types

val genes = sc.loadFeatures("gencode.v21.annotation.gff3.gz")
val polyAs = sc.loadFeatures("gencode.v21.polyAs.gff3.gz")
val pseudogenes = sc.loadFeatures("gencode.v21.2wayconspseudos.gff3.gz")
val tRnas = sc.loadFeatures("gencode.v21.tRNAs.gff3.gz")

// 2. join variants with all GENCODE features
fnothaft commented 7 years ago

To what @heuermh is saying, I would agree that it would be preferable to have the one signature change back to GenomicRDD instead of changing all the signatures to GenericGenomicRDD.

This is separate from the question of opening up the constructor on GenericGenomicRDD.

As an aside, if we go the implicit route, then we add the RDD-level _1/_2 functions while keeping the type as GenomicRDD.

devin-petersohn commented 7 years ago

The implicit route makes sense. I

Moving everything GenomicRDD creates problems for two functionalities that I need downstream:

1.) transform after the join creates a Nothing object. This is because the Z type is inferred to be Nothing after upcasting. 2.) I use a predicate on both the joined data and the regionFn. I cannot access the getReferenceRegions from the outside, and I need a way to get them. Performing a predicate on the regionFn from the GenericGenomicRDD after the join makes more sense to me than opening up permissions on getReferenceRegions.

devin-petersohn commented 7 years ago

@heuermh I don't think it affects the joins in any way either way.

devin-petersohn commented 7 years ago

Actually, now that I think of it, why is getReferenceRegions protected?

heuermh commented 7 years ago

Actually, now that I think of it, why is getReferenceRegions protected?

For me, that is part of a defensive API strategy. Keep things private or package private or otherwise hidden and final and immutable unless it becomes necessary to change it. That way you keep the public API as small as possible.

fnothaft commented 7 years ago

+1 @heuermh

devin-petersohn commented 7 years ago

I understand the idea of keeping the public API small, but isn't it reasonable for a user to want to see the regions for a given record? This is a getter, so mutability isn't the issue. As-is you have to build the ReferenceRegion from the contigName, start, and end, which feels messy considering we already have the plumbing to do it. This is a separate issue than GenericGenomicRDD vs GenomicRDD[((T, U), Z)], so we can revisit it later.