Closed devin-petersohn closed 7 years ago
You mean, GenericGenomicRDD
as opposed to GenomicRDD
? Can you expand upon your motivations, pros/cons, etc?
They should be consistent for proper generalization downstream. In 37b971a3afb80598ffecd5275aff6c3ceaa96ef3 it was changed only for inner shuffle join, and I think that change makes sense to propagate through to the remaining shuffle joins.
Ah, I see what's going on. Yeah, that change sounds OK to me. Do you know if we can move the inner join back down to GenomicRDD instead? That'd be my preference.
I actually prefer GenericGenomicRDD
more here (or some kind of GenomicPairRDD
object) here because of the strange type-casting issue referenced in #1735. It is also a much cleaner abstraction than GenomicRDD[(T, Iterable[X]), Z]
.
Still trying to get my head around this.
Is there any purpose for GenericGenomicRDD
other than the return value for the various join methods?
Could the methods in GenericGenomicRDD
be moved up to GenomicRDD
(would require trait
→ abstract class
)?
Would GenomicPairRDD
limit joins to two types? It seems a completely reasonable use case to join reads, features, and variants. Or may be more efficient to join twice two of the same type than to union first then join (say variants and several different RDDs of features).
Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?
That's exactly their purpose.
Conceptually, we could also support GenericGenomicRDD
as a type that one can transmute
to, which I think would be reasonable. We don't support this now, but its just a matter of adding the conversion functions.
Could the methods in GenericGenomicRDD be moved up to GenomicRDD (would require trait → abstract class)?
The ones that call copy
require GenericGenomicRDD
to be a case class, union
requires a concrete constructor/apply method, and the getReferenceRegions
implementation doesn't make sense for non-generic GenomicRDD
s.
Would GenomicPairRDD limit joins to two types? It seems a completely reasonable use case to join reads, features, and variants. Or may be more efficient to join twice two of the same type than to union first then join (say variants and several different RDDs of features).
Are you proposing to replace GenericGenomicRDD
with GenomicPairRDD
, which would be a GenomicRDD
that enforces that the RDD's value is a tuple? If so, what would be the benefit to this over GenericGenomicRDD
?
Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?
That's exactly their purpose.
Downstream, we need a way to manage the various set theory operations without modifying the internal records, changing the regionFn
to be whatever we need.
Conceptually, we could also support GenericGenomicRDD as a type that one can transmute to, which I think would be reasonable. We don't support this now, but its just a matter of adding the conversion functions.
That would be useful considering BroadcastRegionJoin
and ShuffleRegionJoin
group by methods do not return the same tuple ordering. Being able to swap key and value is useful.
Are you proposing to replace GenericGenomicRDD with GenomicPairRDD, which would be a GenomicRDD that enforces that the RDD's value is a tuple? If so, what would be the benefit to this over GenericGenomicRDD?
The only additional benefit I see is calling ._1
/._2
to get the GenomicRDD
from the left/right side of the join (i.e. the AlignmentRecordRDD
from the left). The use case would be post processing on the set theory operation (i.e. I only want the alignments that overlapped my variants in the output). I have no strong preference between keeping it as is or adding the GenomicPairRDD
.
That would be useful considering BroadcastRegionJoin and ShuffleRegionJoin group by methods do not return the same tuple ordering. Being able to swap key and value is useful.
You can already do that with transform
, and you can transmute
from a GenericGenomicRDD
to a non-generic. We are just missing conversion functions for going to a GenericGenomicRDD
.
The only additional benefit I see is calling ._1/._2 to get the GenomicRDD from the left/right side of the join (i.e. the AlignmentRecordRDD from the left). The use case would be post processing on the set theory operation (i.e. I only want the alignments that overlapped my variants in the output). I have no strong preference between keeping it as is or adding the GenomicPairRDD.
See above; you can already do this with transform
or transmute
on a GenericGenomicRDD
. GenomicPairRDD
wouldn't make this easier.
You can already do that with transform, and you can transmute from a GenericGenomicRDD to a non-generic. We are just missing conversion functions for going to a GenericGenomicRDD.
I may be mistaken, but don't you need transmute
to go from GenericGenomicRDD[(U, T)]
to GenericGenomicRDD[(T, U)]
? The transform
requires a function T => T
, or (U, T) => (U, T)
. Using transform
didn't compile when I tried to go from (U, T) => (T, U)
.
See above; you can already do this with transform or transmute on a GenericGenomicRDD. GenomicPairRDD wouldn't make this easier.
It was meant to be more of a user-facing feature, rather than forcing the user to: 1.) know what type of GenomicRDD
they had, and 2.) use the transmute
API. I can just add that functionality to Lime if it doesn't belong here.
I may be mistaken, but don't you need transmute to go from GenericGenomicRDD[(U, T)]
Yes, you are correct; my apologies for the confusion. My broader point is that we don't need a new class to support this.
It was meant to be more of a user-facing feature, rather than forcing the user to: 1.) know what type of GenomicRDD they had, and 2.) use the transmute API.
Ah, are you saying you'd add a function to GenomicPairRDD
that would call ._1
or ._2
on all elements in the RDD? Something like .keys
or .values
in Spark's PairRDDFunctions
? If so, that makes sense, but I'd prefer to go with Spark's approach where the functions are added implicitly to any GenericGenomicRDD[(X,Y)]
(and implemented via transmute
), instead of replacing GenericGenomicRDD
with PairGenomicRDD
.
Can you sketch out the two use cases above? Does having GenericGenomicRDD
/GenomicPairRDD
as return value help here?
val reads = sc.loadReads("sample.bam")
val variants = sc.loadVariants("sample.vcf.gz")
val features = sc.loadFeatures("ensembl.gff3.gz")
// 1. join all three types
val genes = sc.loadFeatures("gencode.v21.annotation.gff3.gz")
val polyAs = sc.loadFeatures("gencode.v21.polyAs.gff3.gz")
val pseudogenes = sc.loadFeatures("gencode.v21.2wayconspseudos.gff3.gz")
val tRnas = sc.loadFeatures("gencode.v21.tRNAs.gff3.gz")
// 2. join variants with all GENCODE features
To what @heuermh is saying, I would agree that it would be preferable to have the one signature change back to GenomicRDD
instead of changing all the signatures to GenericGenomicRDD
.
This is separate from the question of opening up the constructor on GenericGenomicRDD
.
As an aside, if we go the implicit
route, then we add the RDD-level _1
/_2
functions while keeping the type as GenomicRDD
.
The implicit
route makes sense. I
Moving everything GenomicRDD
creates problems for two functionalities that I need downstream:
1.) transform
after the join creates a Nothing
object. This is because the Z
type is inferred to be Nothing
after upcasting.
2.) I use a predicate on both the joined data and the regionFn
. I cannot access the getReferenceRegions
from the outside, and I need a way to get them. Performing a predicate on the regionFn
from the GenericGenomicRDD
after the join makes more sense to me than opening up permissions on getReferenceRegions
.
@heuermh I don't think it affects the joins in any way either way.
Actually, now that I think of it, why is getReferenceRegions
protected
?
Actually, now that I think of it, why is getReferenceRegions protected?
For me, that is part of a defensive API strategy. Keep things private or package private or otherwise hidden and final and immutable unless it becomes necessary to change it. That way you keep the public API as small as possible.
+1 @heuermh
I understand the idea of keeping the public API small, but isn't it reasonable for a user to want to see the regions for a given record? This is a getter, so mutability isn't the issue. As-is you have to build the ReferenceRegion
from the contigName
, start
, and end
, which feels messy considering we already have the plumbing to do it. This is a separate issue than GenericGenomicRDD
vs GenomicRDD[((T, U), Z)]
, so we can revisit it later.
I would like to make consistent the return types for all
ShuffleRegionJoins
, specifically to returnGenericGenomicRDD
s.