Open mmokrejs opened 6 months ago
Hi @mmokrejs, Thank you for opening this. I am not sure I follow the entire issue, and I'll try to answer from what I understand. Please correct me if I misunderstood something.
IRanges
is to specify intervals. Its irrelevant to this class if the interval is on the positive strand or the negative strand. and yes in R one can specify intervals using a combination of (start, width) or (start, end) pairs. But in the Python implementation this is limited to (start, width) pairs.
So the first step for your usecase is to parse the rows and extract the intervals
data_raw = [('NC_000011.10', 'DA340864;', '(complement)', [(49200441, 49200301), (49206172, 49206067), (49207868, 49208160)], [(1, 144), (145, 250), (251, 549)], ['97%', '99%', '97%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA394358;', '(complement)', [(49200441, 49200294), (49206172, 49206067), (49208292, 49208636)], [(1, 148), (149, 254), (255, 599)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208292, 49208559)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA259004;', '(complement)', [(49192894, 49192792), (49200441, 49200255), (49206172, 49206067), (49208292, 49208453)], [(6, 108), (109, 295), (296, 401), (402, 563)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24]),
('NC_000011.10', 'DA282176;', '(complement)', [(49200441, 49200267), (49206172, 49206067), (49208292, 49208559)], [(1, 175), (176, 281), (282, 549)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA299779;', '(complement)', [(49200441, 49200318), (49206172, 49206067), (49208292, 49208636)], [(1, 124), (125, 228), (229, 573)], ['100%', '98%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208292, 49208559)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA308616;', '(complement)', [(49192894, 49192799), (49200441, 49200255), (49206172, 49206067), (49208292, 49208466)], [(1, 96), (97, 283), (284, 389), (390, 564)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24])]
regions = []
exon_names = []
widths = []
starts = []
for dr in data_raw:
widths.extend([abs(x[0] - x[1]) for x in dr[3]])
starts.extend([x[0] if x[1] > x[0] else x[1] for i, x in enumerate(dr[3])])
exon_names.extend([dr[1]] * len(dr[3]))
regions.extend([dr[0]] * len(dr[3]))
Now that we have some of the information extracted, lets create a GenomicsRanges
object,
from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random
gr = GenomicRanges(
seqnames=[
"chr11"
] * len(starts),
ranges=IRanges(start=starts, width=widths),
strand=["-"] * len(starts),
mcols=BiocFrame(
{
"exon": exon_names,
"region": regions
}
),
)
print(gr)
## OUTPUT
GenomicRanges with 26 ranges and 26 metadata columns
seqnames ranges strand exon region
<str> <IRanges> <ndarray[int8]> <list> <list>
[0] chr11 49200301 - 49200441 - | DA340864; NC_000011.10
[1] chr11 49206067 - 49206172 - | DA340864; NC_000011.10
[2] chr11 49207868 - 49208160 - | DA340864; NC_000011.10
... ... ... | ... ...
[23] chr11 49200255 - 49200441 - | DA308616; NC_000011.10
[24] chr11 49206067 - 49206172 - | DA308616; NC_000011.10
[25] chr11 49208292 - 49208466 - | DA308616; NC_000011.10
------
seqinfo(1 sequences): chr11
Lets reduce these intervals into non-overlapping regions,
print(gr.reduce(with_reverse_map=True))
## OUTPUT
GenomicRanges with 5 ranges and 5 metadata columns
seqnames ranges strand revmap
<str> <IRanges> <ndarray[int8]> <list>
[0] chr11 49192792 - 49192894 - | [9, 22]
[1] chr11 49200255 - 49200441 - | [10, 23, 6, 19, 13, 3, 0, 16]
[2] chr11 49206067 - 49206172 - | [1, 4, 7, 11, 14, 17, 20, 24]
[3] chr11 49207868 - 49208160 - | [2]
[4] chr11 49208292 - 49208636 - | [12, 25, 8, 15, 21, 5, 18]
------
seqinfo(1 sequences): chr11
revmap
contains indices from the original GenomicRanges
object that map to the reduced intervals.
By default the reduce
method computes non-overlapping intervals for every unique (seqname, strand) combination in the object, similar to bioconductor's implementation. Let me know if this is incorrect to what you were expecting. It would also help if you can provide R snippets, so it helps me with debugging and tests.
Hi @jkanche , thank you for your excellent answer, I edited the code to match better biological terms and I fixed/swapped the coordinates of the trailing exons (my mistake in the initial example).
data_raw = [
('NC_000011.10', 'DA340864;', '(complement)', [(49200441, 49200301), (49206172, 49206067), (49208160, 49207868)], [(1, 144), (145, 250), (251, 549)], ['97%', '99%', '97%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA394358;', '(complement)', [(49200441, 49200294), (49206172, 49206067), (49208636, 49208292)], [(1, 148), (149, 254), (255, 599)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208559, 49208292)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA259004;', '(complement)', [(49192894, 49192792), (49200441, 49200255), (49206172, 49206067), (49208453, 49208292)], [(6, 108), (109, 295), (296, 401), (402, 563)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24]),
('NC_000011.10', 'DA259004;', None, [(32091250, 32091265), (81783019, 81783034), (89639236, 89639342), (89645003, 89645189), (89652496, 89652598)], [(116, 131), (145, 160), (162, 268), (269, 455), (456, 558)], ['100%', '100%', '100%', '98%', '99%'], ['==', '==', '->', '->', None], [None, None, '(GT/AG)', '(GT/AG)', None], [None, None, 24, 24, 24]),
('NC_000011.10', 'DA282176;', '(complement)', [(49200441, 49200267), (49206172, 49206067), (49208559, 49208292)], [(1, 175), (176, 281), (282, 549)], ['100%', '100%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA299779;', '(complement)', [(49200441, 49200318), (49206172, 49206067), (49208636, 49208292)], [(1, 124), (125, 228), (229, 573)], ['100%', '98%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [24, 24, 24]),
('NC_000011.10', 'DA301520;', '(complement)', [(49200441, 49200263), (49206172, 49206067), (49208559, 49208292)], [(1, 179), (180, 285), (286, 553)], ['99%', '99%', '100%'], ['->', '->', None], ['(GT/AG)', '(GT/AG)', None], [23, 24, 24]),
('NC_000011.10', 'DA308616;', '(complement)', [(49192894, 49192799), (49200441, 49200255), (49206172, 49206067), (49208466, 49208292)], [(1, 96), (97, 283), (284, 389), (390, 564)], ['100%', '100%', '100%', '100%'], ['->', '->', '->', None], ['(GT/AG)', '(GT/AG)', '(GT/AG)', None], [24, 24, 24, 24])
]
chromosome_accessions = []
mRNA_accessions = []
widths = []
starts = []
for dr in data_raw:
widths.extend([abs(x[0] - x[1]) for x in dr[3]])
starts.extend([x[0] if x[1] > x[0] else x[1] for i, x in enumerate(dr[3])]) # this will break if a gene spans origin (zero) in a circular chromosome
mRNA_accessions.extend([dr[1] for splice_var in data_raw])
chromosome_accessions.extend([dr[0] for splice_var in data_raw])
from genomicranges import GenomicRanges
from iranges import IRanges
from biocframe import BiocFrame
from random import random
import string
def generate_exon_names(count):
names = []
for letter1 in string.ascii_uppercase:
names.append('exon_%s' % letter1)
if count > len(string.ascii_uppercase):
for letter1 in string.ascii_uppercase:
for letter2 in string.ascii_uppercase:
names.append('exon_%s%s' % (letter1, letter2))
return names[:count]
splice_variants = [(i, splice_var[3]) for i, splice_var in enumerate(data_raw)]
strands = []
for splice_variant in splice_variants:
#print("splice variant: %s" % str(splice_variant))
i, exons = splice_variant
#print("All exons %s" % str(exons))
for exon in exons:
#print("Individual exon %s" % str(exon))
if data_raw[i][2] == '(complement)':
strands.append('-')
elif data_raw[i][2] is None:
strands.append('+')
else:
raise("Unexpected strand: %s" % str(data_raw[i][2]))
gr = GenomicRanges(
seqnames=[
"chr11"
] * len(starts),
ranges=IRanges(start=starts, width=widths),
strand=strands,
mcols=BiocFrame(
{
"accession": mRNA_accessions,
"exon": generate_exon_names(len(starts)),
"chr_accession": chromosome_accessions
}
),
)
print(gr)
Here I show which ot the splice variants can be merged:
[0] (49200441, 49200301), (49206172, 49206067), (49208160, 49207868)
[1] (49200441, 49200294), (49206172, 49206067), (49208636, 49208292)
[2] (49200441, 49200263), (49206172, 49206067), (49208559, 49208292) can be merged with data_raw[6]
[3] (49192894, 49192792), (49200441, 49200255), (49206172, 49206067) cannot be merged with data_raw[7] because 49192792 != 49192799
[4] (49200441, 49200267), (49206172, 49206067), (49208559, 49208292)
[5] (49200441, 49200318), (49206172, 49206067), (49208636, 49208292)
[6] (49200441, 49200263), (49206172, 49206067), (49208559, 49208292)
[7] (49192894, 49192799), (49200441, 49200255), (49206172, 49206067), (49208466, 49208292)
>>> print(gr)
GenomicRanges with 31 ranges and 31 metadata columns
seqnames ranges strand accession exon chr_accession
<str> <IRanges> <ndarray[int8]> <list> <list> <list>
[0] chr11 49200301 - 49200441 - | DA340864; exon_A NC_000011.10
[1] chr11 49206067 - 49206172 - | DA340864; exon_B NC_000011.10
[2] chr11 49207868 - 49208160 - | DA340864; exon_C NC_000011.10
... ... ... | ... ... ...
[28] chr11 49200255 - 49200441 - | DA308616; exon_AC NC_000011.10
[29] chr11 49206067 - 49206172 - | DA308616; exon_AD NC_000011.10
[30] chr11 49208292 - 49208466 - | DA308616; exon_AE NC_000011.10
------
seqinfo(1 sequences): chr11
>>> print(gr.reduce(with_reverse_map=True))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/foo/genomicranges/lib/python3.11/site-packages/genomicranges/GenomicRanges.py", line 1561, in reduce
output._mcols.set_column("revmap", rev_map, in_place=True)
File "/foo/genomicranges/lib/python3.11/site-packages/biocframe/BiocFrame.py", line 936, in set_column
return self.set_columns({column: value}, in_place=in_place)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/foo/genomicranges/lib/python3.11/site-packages/biocframe/BiocFrame.py", line 964, in set_columns
raise ValueError(
ValueError: Length of `value`, does not match the number of the rows,need to be 10 but provided 15.
>>>
@mmokrejs I fixed this bug yesterday, you may need to update the package to 0.4.14.
OK, 0.4.14 fixed the 0.4.12 issue. So now I have:
>>> print(gr)
GenomicRanges with 31 ranges and 31 metadata columns
seqnames ranges strand accession exon chr_accession
<str> <IRanges> <ndarray[int8]> <list> <list> <list>
[0] chr11 49200301 - 49200441 - | DA340864; exon_A NC_000011.10
[1] chr11 49206067 - 49206172 - | DA340864; exon_B NC_000011.10
[2] chr11 49207868 - 49208160 - | DA340864; exon_C NC_000011.10
... ... ... | ... ... ...
[28] chr11 49200255 - 49200441 - | DA308616; exon_AC NC_000011.10
[29] chr11 49206067 - 49206172 - | DA308616; exon_AD NC_000011.10
[30] chr11 49208292 - 49208466 - | DA308616; exon_AE NC_000011.10
------
seqinfo(1 sequences): chr11
>>> print(gr.reduce(with_reverse_map=True))
GenomicRanges with 10 ranges and 10 metadata columns
seqnames ranges strand revmap
<str> <IRanges> <ndarray[int8]> <list>
[0] chr11 32091250 - 32091265 + | [13]
[1] chr11 81783019 - 81783034 + | [14]
[2] chr11 89639236 - 89639342 + | [15]
[3] chr11 89645003 - 89645189 + | [16]
[4] chr11 89652496 - 89652598 + | [17]
[5] chr11 49192792 - 49192894 - | [9, 27]
[6] chr11 49200255 - 49200441 - | [10, 28, 6, 24, 18, 3, 0, 21]
[7] chr11 49206067 - 49206172 - | [1, 4, 7, 11, 19, 22, 25, 29]
[8] chr11 49207868 - 49208160 - | [2]
[9] chr11 49208292 - 49208636 - | [12, 30, 8, 20, 26, 5, 23]
------
seqinfo(1 sequences): chr11
>>>
But, I did not want to have isolated list of exons but to retain compatible (actually existing) range combinations . Assumably the 8 items from data_raw
would shrink down to 7 items. Let me think what the reduce
actually did.
Basically, it does not do what I need. It should act on the lists of ranges (read on the fragments of splice variants defined by a combination of exonic ranges). The very first and very last exon might be incomplete, hence why I initially wrote that start1
and end5
positions are not important for the decision if the two listings of ranges can be merged (in the virtual example I initially spotted both splice variants consist of 5 exons). If we speak of this 5-exonic example and being on minus strand of the genome, the start1
being greater and endX
being smaller should be incorporated after the union
. All the internal exon junctions must be exactly same (read the start[2,3,4,5]
must match exactly in the pair of splice variants to be eventually unified together and likewise, end[1,2,3,4]
must match exactly. In brief, the inner start
/end
values must match exactly while the very first start
and very last end
values may differ and the widest result of the union should be output as a result. Am I clearer now?
The abs(x[0] - x[1])
is asking for a trouble. It would be much better if python crashed somewhere on a negative value. Moreover I think x[0] - x[1] + 1
should be there if x[0] > x[1]
else x[1] - x[0] + 1
. This will break anyway if the range crosses zero (bacterial circular chromosome origin). The (start, width)
definition is bad and error-prone that it should be dismounted.
Chiming in here.
I find it very unusual to see start positions to be greater than the end position, irrespective of the strand. Certainly this is not the case for any standard formats (GFF, BED), nor is it the case for the R GRanges
(aside from the corner case of zero-width ranges due to the closed nature of the interval definition). We will not be supporting this.
As for the (start, width)
choice; well, you either have to specify the width or the end to define an interval, and one is as good as the other. Your background may suggest to you that end
is a "obvious" choice, but this opinion is not necessarily shared. For example, I regularly use width
instead of end
when handling interval data, because the use of unsigned integer widths guarantees the validity of the interval without requiring a check for end >= start. Both the R and Python implementations also use width
under the hood to represent IRanges
, so it is natural for the constructor to accept width
rather than end
if the former is already available. You could argue that it may be more user-friendly to support end
in the IRanges
constructor, seeing as how all of the standard formats use end
rather than width
; I wouldn't disagree with that, but that would be in addition to the existing width=
option.
As for the question of what you're trying to do: I don't think the GRanges
generic functions, either here or in the R package, will give you an easy "one-click" solution. Off the top of my head; it would be something like:
find_overlaps
with type="equal"
(note: not yet supported in the Python version) on the GRanges
containing the exons, to find all groups of shared exons across splice variants.Seems fairly complex.
Hi @LTLA , thank you for joining in.
I find it very unusual to see start positions to be greater than the end position, irrespective of the strand. Certainly this is not
About half of the genes in a genome are on a minus strand.
the case for any standard formats (GFF, BED), nor is it the case for the R
GRanges
(aside from the corner case of zero-width ranges due to the closed nature of the interval definition). We will not be supporting this.
Please realize the first exon (exon A) is on the far right, while last exon is on far left. Start of an exon will always be where it actually starts, that is where a ribosome starts to synthesize a protein. You cannot change it hoping that biological reality will adjust to broken programmers assumptions. There is no way to enumerate the exons from "left" nor number them from lower to higher coordinate positions. Take the below as authoritative example of a gene annotated on a minus strand. Notably, the exonic start
positions are >
those of end
and it cannot be any other way.
ExonsSpreadsheet-Homo_sapiens_Transcript_Exons_ENST00000356696.csv ExonsSpreadsheet-Homo_sapiens_Transcript_Exons_ENST00000340334.csv ExonsSpreadsheet-Homo_sapiens_Transcript_Exons_ENST00000256999.csv
As for the
(start, width)
choice; well, you either have to specify the width or the end to define an interval, and one is as good as the other. Your background may suggest to you thatend
is a "obvious" choice, but this opinion is not necessarily shared.
When you parse any annotation format like GBK or GFF, the coordinates of exons and of genes in general are start
and end
positions. The width
must be calculated as an extra work and moreover, is prone to mistakes if there is uncertainty whether the width
should be applied rightwards or leftwards from the start
.
For example, I regularly use
width
instead ofend
when handling interval data, because the use of unsigned integer widths guarantees the validity of the interval without requiring a check for end >= start. Both the R and Python implementations also
This is exactly why I think start, width
should be deprecated. Those asking for a trouble may continue using it but assumingly those working with minus strand annotation will burn their hands sooner or later.
use
width
under the hood to representIRanges
, so it is natural for the constructor to acceptwidth
rather thanend
if the
Propagation of broken underlying design does not justify anything. This leads to nowhere, like if we discuss 0-based numbering still being retained in python as it is compatible with the slicing syntax and in BED format.
former is already available. You could argue that it may be more user-friendly to support
end
in theIRanges
constructor, seeing as how all of the standard formats useend
rather thanwidth
; I wouldn't disagree with that, but that would be in addition to the existingwidth=
option.As for the question of what you're trying to do: I don't think the
GRanges
generic functions, either here or in the R package, will give you an easy "one-click" solution. Off the top of my head; it would be something like:1. `find_overlaps` with `type="equal"` (note: not yet supported in the Python version) on the `GRanges` containing the exons, to find all groups of shared exons across splice variants. 2. For each variant, iterate across its exons, extract the set of variants with the same exon, and find the intersection. Each variant in the intersection contains a superset of the current variant's exons. 3. Decide which ones you want to keep.
Seems fairly complex.
Each genome annotation/prediction website tries to merge compatible splice variants to a minimum. It is not difficult, just checking if the inner exon junction have exactly same values and once they differ, the two variants in question cannot be merged. Alternatively, if all the inner exon junctions are same then union of the inner exons and (pre/ap)pend those wider of the first and last exons.
Hi, I thought I can use
GenomicRanges
to create its objects based on[(start1,end1), (start2,end2), (start3,end3), (start4,end4), (start5,end5)]
exon positions and then unify these splice variants into less "rows" while ensuringstart[2,3,4,5]
must match exactly andend[1,2,3,4]
must match exactly. The examples in README.md but even docs/tutorial.md are simply comfusing. I understand it is easy to programmatically smash in some number forstart
s andend
s (orwidth
s, huh?) but neither of the examples seems to carefully demonstrate how one can represent several splice variants sharing several exons.Moreover, the https://biocpy.github.io/GenomicRanges/api/genomicranges.html#genomicranges.GenomicRanges.GenomicRanges.union seems to do something what I do not want.
Then https://biocpy.github.io/GenomicRanges/api/genomicranges.html#genomicranges.GenomicRanges.GenomicRanges.reduce does something what I do not understand.
What is scary for me that IRanges accept
start
andwidth
instead ofstart
andend
. I fear if I specify on minus strand start=21, end=10 some unknown code breaks becausestart > end
and I fearstart=21
,width=9
will result instart=21, end=30
.The documentation is so large that I do not even want to study if this originates from
IRanges
orR
implementation ofgenomicranges
else.Please make the examples realistic.
These shoukd be a helper method so one does not need to learn the internals of the implementation and simply pass-in a list of tuples and be done with instantiation of a
GenomicRanges
object.My data are in the following format, some examples:
I will have thousand of those and want to meaningfully merge the splice variants if they have exactly same exons except
start1
andend5
values which do not have to be exactly same. The resulting unified splice variant(s) should retain these two values to accommodate the widest splice variant from the input.Ideally I want to numsort by start and end positions, left to right. Please note this gene is on minus strand.
Thank you for any effort to make the package and its documentation more usefull to biologists.