Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
52 stars 16 forks source link

Encoding ASCII up to 255 #65

Open gevro opened 2 years ago

gevro commented 2 years ago

How do I use BString / BStringSet for values > 127?

PacBio data uses 0:255 ranges for fp/rp/fi/ri tags. I am trying to use BioStrings to work with these tags.

But I ran into trouble, for values > 127.

Technically, ASCII shoudl go up to 255, so I'm not sure why BString shouldn't be able to handle things like this: BStringSet(rawToChar(as.raw(135)))

It seems to just return '?' for any value > 127.

gevro commented 2 years ago

Also, how do I pass ASCII (raw) value 0 to BStringSet?

This works:

> BStringSet(rawToChar(as.raw(70)))
BStringSet object of length 1:
    width seq
[1]     1 F

This doesn't work

> BStringSet(rawToChar(as.raw(0)))
BStringSet object of length 1:
    width seq
[1]     0 
hpages commented 2 years ago

Hi,

rawToChar(as.raw(0)) is an empty string so no surprise that BStringSet(rawToChar(as.raw(0))) returns what it does.

The '?' you see when displaying BStringSet(rawToChar(as.raw(135))) is only a displaying issue (more precisely it's a show() method issue), but the content of the BStringSet object is as expected:

x <- BString(rawToChar(as.raw(135)))
x
# 1-letter BString object
# seq: �

as.integer(x)
# [1] 135

You want to avoid using the character string intermediate representation when going from raw to BString. One way to do this is:

x <- as(as(as.raw(c(0L, 135L, 0L, 255L)), "XRaw"), "BString")

Again, display is broken:

x
# 4-letter BString object
# Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
#   embedded nul in string: '\0\x87\0\xff'

But content is correct:

length(x)
# [1] 4

as.integer(x)
[1]   0 135   0 255

length(c(x, reverse(x)))
# [1] 8

as.integer(c(x, reverse(x)))
# [1]   0 135   0 255 255   0 135   0

2 things that would improve the current situation: (1) make it easier to go from raw to BString or BStringSet, and (2) fix display.

Unfortunately I don't have the resources to work on these improvements at the moment but PRs are welcome.

Thanks, H.

FelixErnst commented 5 months ago

Hi. quite late to the conversation.

What the rational for these values? I am not familiar with what the values represent. I assume some kind of quality values. In addition to the PR as suggest, you might want to consider a dedicate class as extension of Biostrings package.

In Modstrings, currently 166 integer values are used for representing RNA modifications. Internally, this these integers are used and just converted for 'show' to actual letters. Both parts are totally independent, so you are not constricted to a 1:1 ASCII to int conversion.

Felix

gevro commented 5 months ago

These ascii values are encoding kinetic information for PACbio sequencing data. they are a standard part of the Pacbio data format. So it would be useful to have ability to convert integers from 0 to 255.

FelixErnst commented 5 months ago

From experience, representing that with Biostrings should not be a problem. However, displaying these values to console is separate issue. What you in console, is not necessarily what is stored.

Example: For DNA, the values 1,2,4,8 (4 bits) are used to store letter G, A, T, C (at least 8 bit; in case of ASCII 7 bit). Only during display to console, does the conversion take place. The following snippet throws an error, but this is just because the lookup lkup is NULL

> dss <- DNAStringSet(c("G","A","T","C"))
> XVector:::extract_character_from_SharedRaw_by_positions(dss[[1]]@shared,1L,1L,collapse = FALSE)
Error in XVector:::extract_character_from_SharedRaw_by_positions(dss[[1]]@shared,  : 
  'x' contains an invalid byte (4 = char '') at position 1

So you need to be careful when debugging your code, that you not mistake "data stored" and "representation shown". Your example does actually work, because you put in an empty string.

> BStringSet(rawToChar(as.raw(0)))
BStringSet object of length 1:
    width seq
[1]     0 
# This is the same as this
> BStringSet("")
BStringSet object of length 1:
    width seq
[1]     0 
# Because zero for R is actually an empty character
> rawToChar(as.raw(0))
""

So you might have the problem, that without realizing, you did an integer to character conversion, which you actually don't want to do. You want to keep the value as is, if I understand you correctly. The BString is the big string class in Biostrings, which means "a big string (a long sequence of characters)" (quote from the man page ?BString)

I could image, that you need something more like this, since quality values can be stored using the XStringQuality class from Biostrings.

> qs <- QualityScaledDNAStringSet(x = DNAStringSet(c("G","A","T","C")),quality = PhredQuality(IntegerList(1,2,3,4)))
> qs
  A QualityScaledDNAStringSet instance containing:

DNAStringSet object of length 4:
    width seq
[1]     1 G
[2]     1 A
[3]     1 T
[4]     1 C

PhredQuality object of length 4:
    width seq
[1]     1 "
[2]     1 #
[3]     1 $
[4]     1 %

The interesting bit is, that you need to use the IntegerList class as far as I know, since you cannot create a XStringQuality from a list and using c which turns the values during BString construction into actual characters.

> il <- IntegerList(0:255)
> il
IntegerList of length 1
[[1]] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ... 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
> pq <- PhredQuality(il)
> pq
PhredQuality object of length 1:
    width seq
[1]   256 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijk...~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

However, also in pq not the integer values are stored, since during object creation the are converted as stated in the manual "PhredQuality objects store characters that are interpreted as [0 - 99] quality measures by subtracting 33 from their ASCII decimal representation (e.g. ! = 0, " = 1, # = 2, ...)"

> as.integer(pq[[1]])
  [1]  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
 [40]  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110
 [79] 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126
[118] 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126
[157] 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126
[196] 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126
[235] 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126 126

So you might want to consider at least three steps:

Personally, I would start from the last step.

FelixErnst commented 5 months ago

These ascii values are encoding kinetic information for PACbio sequencing data. they are a standard part of the Pacbio data format. So it would be useful to have ability to convert integers from 0 to 255.

Hint: this is a bit confusing. Do you want to store ASCII or integer values from 0 to 255? It is either or. Both cannot be true at the same time.

ASCII also just has 128 values and only some are human readable: https://en.wikipedia.org/wiki/ASCII

gevro commented 5 months ago

Sorry, I by mistake wrote ASCII. What I mean is simply integer values from 0 to 255.

Here is an example tag from a read in a BAM file: ip:B:C,30,29,31,34,34,34,27,28,27,30,28,31,29,46,36,29,29,38,27,33,33,30,34,27,39,26,37,46,34,27,29,29,34,32,33,36,34,29,32,30,25,28,30,30,34,31,28,31,33,25,37,34,22,27,32,45,26,23,25,26,25,30,35,30,28,31,33,37,23,28,31,41,30,27,25,30,29,30,30,40,30,29,31,28,31,45,67,29,34,34,36,23,27,25,33,36,35,33,27,33,30,30,26,27,26,38,82,26,36,33,47,28,25,30,34,32,31,38,27,24,42,29,30,27,28,25,31,26,30,36,25,26,27,33,31,34,49,32,32,28,24,24,33,27,26,26,34,31,30,29,29,22,25,29,30,25,39,32,29,28,26,26,27,42,28,32,28,28,32,34,23,28,28,31,34,32,34,31,39,28,39,26,30,28,26,33,41,32,24,24,34,25,33,29,27,42,31,40,33,28,27,34,27,22,22,27,25,29,29,36,35,47,29,26,29,26,24,25,28,33,25,26,40,26,25,23,29,26,68,49,29,27,25,33,27,31,31,23,38,31,35,36,31,34,42,24,25,26,44,26,30,33,30,29,30,31,24,28,26,25,43,30,27,32,26,25,30,25,30,28,30,31,43,28,27,24,32,29,23,26,25,23,22,31,24,30,27,42,36,23,26,26,28,28,36,34,26,27,36,27,34,29,35,45,22,32,38,32,32,28,36,38,31,27,31,20,25,28,31,31,27,33,37,69,25,28,29,32,28,28,42,35,29,29,37,73,68,49,25,25,31,38,26,26,35,23,40,53,30,37,30,25,30,26,31,59,31,66,32,30,40,28,26,37,23,25,34,44,23,33,32

Per BAM specifications, the 'B:C' type of tags encode an array of unsigned 8-bit integers, i.e. 0 to 255.

Because I need to use the sequenceLayer command on these arrays, I need to load them as BStringSet objects.

What I do for these arrays, where x is the array is: BStringSet(as(as(as.raw(x),"XRaw"),"BString"))

So the question is if there is an easier / better way to do this conversion.

hpages commented 5 months ago

So the question is if there is an easier / better way to do this conversion.

Not at the moment but the good news is that we finally might have the resources to work on this soon.

gevro commented 5 months ago

That would be great - thanks!

FelixErnst commented 5 months ago

What I do for these arrays, where x is the array is: BStringSet(as(as(as.raw(x),"XRaw"),"BString"))

This will convert it to a character nonetheless instead of retaining the original integer value. In addition you are not covering the standard, since you cannot represent 0 that way

> x <- 0:255
> BStringSet(as(as(as.raw(x),"XRaw"),"BString"))
BStringSet object of length 1:
    width seq
Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
  embedded nul in string: '\0\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ'
gevro commented 5 months ago

Ok. I'm assuming per @hpages that you all will be able to create a better way to do this soon.

hpages commented 5 months ago

@FelixErnst BStringSet(as(as(as.raw(x),"XRaw"),"BString")) actually does contain the original values:

> bstring <- BStringSet(as(as(as.raw(0:255),"XRaw"),"BString"))
> identical(as.integer(bstring[[1]]), 0:255)
[1] TRUE

In addition you are not covering the standard, since you cannot represent 0 that way

The strings in XString/XStringSet objects are not 0-terminated like C strings. So 0 can actually be part of an XString/XStringSet object. As explained earlier in this thread, the error you are showing above is an issue with the show() method. But the internal representation of the object is correct.

FelixErnst commented 5 months ago

Yes, you are right.

With the conversion the constructor is avoided, which either would call as.character (BStringSet) or would not work at all with integer values (BString). Nice!

  • Implement your own XStringQuality. which does not convert any integer values.
  • Check that you can still use the BStringSet class as backend to represent the values for your XStringQuality, Otherwise implement your own XStringSet class.
  • figure out a meaningful way, how to print those values to console.

So I guess this is a yes for the second point.

ahl27 commented 2 months ago

Finally getting around to this--I'm working on adding an optional parameter to display extended ASCII values for BStrings. It would be ideal for to maintain the "1 position = 1 character" display type for other strings, meaning that all values in 0:255 map to exactly one character.

One option to do that is to use some standard set of single character codes that supports values in 0-255. It happens that the Braille character set has exactly 256 characters (encoded in unicode 0x2800-28FF).

This would look like this:

> x <- as(as(as.raw(0:255),"XRaw"),"BString")
> x
256-letter BString object
Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
  embedded nul in string: '\0\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !"#'
> options(Biostrings.showRaw=TRUE)
> x
256-letter BString object
seq: ⠀⠁⠂⠃⠄⠅⠆⠇⠈⠉⠊⠋⠌⠍⠎⠏⠐⠑⠒⠓⠔⠕⠖⠗⠘⠙⠚⠛⠜⠝⠞⠟⠠⠡⠢⠣...⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿
> 

It's not super readable, but our options are relatively limited when it comes to maintaining readability with a 256 character set...it's at least more readable than the current implementation. Definitely open to other suggestions on that front, for now I'm just looking to have something that can display at all. Most of the BString functionality works perfectly well aside from the printing to the console bit, so it seems like that's the main task to solve here.

Currently working on a more robust version, I'll update with a link when it's in a semi-ready state.

ahl27 commented 2 months ago

Example implementation is available at https://github.com/ahl27/Biostrings/commit/494d81af7e087f6c59a7bfa0a2f05243a4199973

Cons of this approach are that it only supports ASCII values from 0:254. There's a limitation of the internal value not supporting embedded 0s anywhere, so I'm having trouble being able to translate every character without losing some. One option is just to overload 256 and 0 to the same value purely for display purposes, which isn't the worst tradeoff for this sort of thing. Again, this is just for display--internally the values are all the same.

Edit: all values are now supported. Examples below.

> x <- as(as(as.raw(0:255),"XRaw"),"BString")
> x
256-letter BString object
Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
  embedded nul in string: '\0\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !"#'

> options(Biostrings.showRaw=TRUE)                                                                                             
> x
256-letter BString object
seq: ⠀⠁⠂⠃⠄⠅⠆⠇⠈⠉⠊⠋⠌⠍⠎⠏⠐⠑⠒⠓⠔⠕⠖⠗⠘⠙⠚⠛⠜⠝⠞⠟⠠⠡⠢⠣...⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿
> x[1:10]
10-letter BString object
seq: ⠀⠁⠂⠃⠄⠅⠆⠇⠈⠉
> x[1:110]                                                                                                                     
110-letter BString object
seq: ⠀⠁⠂⠃⠄⠅⠆⠇⠈⠉⠊⠋⠌⠍⠎⠏⠐⠑⠒⠓⠔⠕⠖⠗⠘⠙⠚⠛⠜⠝⠞⠟⠠⠡⠢⠣...⡊⡋⡌⡍⡎⡏⡐⡑⡒⡓⡔⡕⡖⡗⡘⡙⡚⡛⡜⡝⡞⡟⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭
> extract_character_from_XString_by_ranges(x, 1L, 5L, FALSE)                                                                   
[1] "⠀⠁⠂⠃⠄"
> extract_character_from_XString_by_ranges(x, 10L, 10L, FALSE)                                                                  
[1] "⠉⠊⠋⠌⠍⠎⠏⠐⠑⠒"
> extract_character_from_XString_by_ranges(x, c(10L, 20L), c(10L, 10L), FALSE)                                                  
[1] "⠉⠊⠋⠌⠍⠎⠏⠐⠑⠒" "⠓⠔⠕⠖⠗⠘⠙⠚⠛⠜"
> extract_character_from_XString_by_ranges(x, c(10L, 20L), c(10L, 10L), TRUE)                                                  
[1] "⠉⠊⠋⠌⠍⠎⠏⠐⠑⠒⠓⠔⠕⠖⠗⠘⠙⠚⠛⠜"

Again, it's not the most beautiful thing in the world, but it does display all possible 8-bit integer values in a consistent format. Still a WIP, definitely open to any suggestions on better character sets we could use. Long term it's going to need a little bit more thought to ensure that things like as.character(x) won't be negatively affected. I'll revisit this later after unit testing is all fleshed out (timeline for that is here).

Long-term, it definitely makes sense to make a more robust raw <-> BString set of functions to facilitate this kind of stuff...running as(as(as.raw(x),"XRaw"),"BString") is obviously not ideal...

hpages commented 2 months ago

Thanks @ahl27. My suggestion would be to map only non-printable ASCII codes to these special characters. But for printable ASCII codes (i.e. codes >= 32 and <= 126) I think we should just stick to their usual representation.

gevro commented 2 months ago

Hi all, As the original poster of this issue - thank you for the efforts!

I also wanted to point out another situation / consideration: doing sequenceLayer on a BStringSet, the sequenceLayer command injects custom ASCII characters for D, N, I, S, H CIGAR operations specified by the D.letter, N.letter, I.letter, S.letter, and H.letter parameters. However, if the input BStrings already span the full ASCII character range, then there is no way to distinguish the input ASCII characters from these new injected ASCII characters.

I think sequenceLayer should have a way to somehow inject ASCII characters that are outside the range of the BString range. Not sure how that might be possible, but this is a use case that I think is relevant to the above. Or perhaps if you want I can post it as a separate issue.

ahl27 commented 2 months ago

To Hervé's comment--That's a good point...I'll take a longer look at it tomorrow. Maybe there's a cleaner solution than this first pass.

Maybe it would be easier to just build a new codec for BString objects that only gets used if getOption(Biostrings.showRaw) == TRUE. That solution would be significantly simpler, but due to everything using utf8, we'd have to map multiple byte values to the same output. If that isn't an issue, it would be really simple to build a codec that just maps everything outside 32-126 to just ? or something, and only use it for decoding. As I'm writing this, I'm thinking this may be the best option (along with a note/warning added to the documentation). Could also consider throwing a warning when as.character() is called on a BString with the showRaw option set to TRUE.

If people want to be able to compare strings following an as.character conversion, using Unicode is probably the cleanest option, but it does make things a little more clunky than just following the established conventions. I'm not sure how common a use case that is, though...I don't see a huge problem with encouraging usage of the XString comparators, and I'm nervous about building a lot of extra stuff for this one issue. Maybe I stash this solution and down the line (if necessary) we add a RawString class that can support multibyte/unicode character codecs.

gevro commented 2 months ago

Just to add - I think that sequenceLayer also would need to be changed to allow assignment of characters for special CIGAR operations outside the range of 0 to 255, so that these operations can be distinguished from input data that spans 0 to 255.

ahl27 commented 2 months ago

@gevro thats an interesting use-case...I'm unfortunately not sure if it's possible without a significant overhaul of the entire XVector class, which is somewhat beyond the scope of Biostrings. The BString input range is unsigned 8-bit integers, and iirc supporting values in the complete 0:255 range is a common request.

The issue is that the internal calls to pull subsequences of a XString expects each element of the string to map to a single char, which are uint8_t values. If we were to be able to distinguish the codes you're mentioning from standard BString characters, we'd have to either reserve some values in the range as illegal for BString input, or support values larger than 255...the former would be impossible while satisfying the complete 0:255 range support ask, but supporting values larger than 255 would mean we have to change the internal calls to support non-char arguments, which would require a lot of changes across multiple packages.

Maybe @hpages can confirm that that assessment is correct, he may have other ideas that make sense.

Otherwise, I'd suggest the best option is likely:

gevro commented 2 months ago

Thanks for your thoughts. I defer to @hpages of course on potential future ways of addressing this additional issue. Or if there is a work-around (I can't think of one).

ahl27 commented 2 months ago

Just some more further thoughts, since I've been thinking about this for a while now...

For completeness sake, I should also mention that there is an argument to be made that non-ASCII values aren't something we should support. XString objects are meant to be containers for storing sequences of characters. If you're working with byte values or larger values, XRaw and XInteger values (resp.) are more likely what you should be using.

It doesn't seem like the information you're trying to load is a sequence at all, and measures like quality scores are already supported for arbitrary values using things like XStringQuality or even just general metadata in mcols. Are you sure these numbers need to be loaded in a BString? I'd almost recommend https://github.com/Bioconductor/Biostrings/issues/65#issuecomment-2016737525 as a more viable solution than trying to force these into a BString or BStringSet.

Maybe I'm misunderstanding your previous comments, so perhaps more information on what the input data is would help. "Kinetic information for PacBio sequencing reads" indicates sequencing metadata, and according to the PacBio datastandard, BC tags are just barcoding information (which would also be metadata, not BString objects). Could you clarify a little more why you're trying to run sequenceLayer on this input?

That said, @FelixErnst has already mentioned that there's a usage for nonstandard ASCII values in Modstrings, and the internals do already support inputs in the 0:255 range. My preferred solution would be to allow an option to at least not throw errors when we call show on an object with invalid ASCII values, since I don't think it's that much of an edge case that we can't support it. I'll definitely work on incorporating that in the near future (see #112).

Sorry for all the back and forth, I was excited about working on this and jumped into it a little too fast--should've reread the discussion a few more times first.

gevro commented 2 months ago

Thanks very much for trying to help here. Hopefully I can clarify why I'm stuck a bit better.

I'm analyzing kinetic PacBio data, from the tags 'ip' and 'pw' in PacBio BAM files. These are encoded as 'B,C' type tags: https://pacbiofileformats.readthedocs.io/en/13.0/BAM.html . Per SAM specifications, 'B,C' type tags are unsigned 8-bit integers, i.e. values of 0 to 255: https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf.

These ip and pw tags have one value per query base. My goal is to then convert these from query to reference coordinates, using the 'sequenceLayer' function, which is the only function I know of that can take a string/array of values and convert from query to reference coordinates given a CIGAR alignment string.

Since 'sequenceLayer' only takes XStringSet objects as input, I'm forced to store the ip and pw values for all the reads as a BStringSet, which is then input into sequenceLayer.

But the issue is that sequenceLayer injects deletions as an ASCII character between 0 to 255 in the new reference-coordinate transformed string. And I can't differentiate the injected deletion character from any identical values in the original ip / pw tags.

Hopefully that makes sense now? Given that, is there any feasible work-around?

gevro commented 2 months ago

Just a thought I had: maybe a solution for this situation would be if 'sequenceLayer' could also accept and return XRaw or another type of object that allows multi-byte values (i.e. range bigger than 0 - 255)? That way I don't have to convert to BString/BStringSet before running 'sequenceLayer', and then the injected insertion/deletion/etc characters can be set to values outside of the 0-255 range.

Just note, this is a separate issue from the above visualization issue.

ahl27 commented 2 months ago

I'm analyzing kinetic PacBio data, from the tags 'ip' and 'pw' in PacBio BAM files. These are encoded as 'B,C' type tags: https://pacbiofileformats.readthedocs.io/en/13.0/BAM.html . Per SAM specifications, 'B,C' type tags are unsigned 8-bit integers, i.e. values of 0 to 255: https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf.

ah, I see--my mistake, I was looking at BC tags, not B,C type tags.

This is making more sense, I just want to clarify a little more since this isn't an analysis I'm super familiar with. Thanks for bearing with me haha

These ip and pw tags have one value per query base. My goal is to then convert these from query to reference coordinates, using the 'sequenceLayer' function...

Are the ip/pw tags what you're actually using to map query->reference coordinates? I'm envisioning one of two scenarios:

(Scenario 1) The mapping is based on the query sequence but we can't lose the kinetics info, e.g.

Reference: AAAAAATTTTT

  Query:  A  A  T  T  T  T
ip vals: ff f0 cc c0 88 80

# Mapping
Ref:  A  A  A  A  A  A  T  T  T  T  T
  Q:  -  -  -  -  A  A  T  T  T  T  -
 IP:  -  -  -  - ff f0 cc c0 88 80  -

Here there isn't any need to load these values as a BString, we can just store them as metadata associated with each BString in the set, and then subset them with the returned subsetting of sequenceLayer. If you're mapping based on strings but want to also weight by the ip/pw tags, then that would be the same as making a new XStringQuality class.

(Scenario 2) Both query/reference are just the B,C tag info:

Reference: 00 04 08 0a 0c 0e 40 44 48 4a 4c 4e 80 84 88 8a 8c 8e
Query: 08 0c 0e 40 44

# Mapping
Ref: 00 04 08 0a 0c 0e 40 44 48 4a 4c 4e 80 84 88 8a 8c 8e
  Q: -- -- 08 -- 0c 0e 40 44 -- -- -- -- -- -- -- -- -- --

And in this scenario we definitely need to load both sequences as BString objects before passing to sequenceLayer, or come up with some kind of other workaround.

Would you mind commenting on which of these scenarios are closer to correct? Or if neither are correct, maybe like a very small contrived example to illustrate would be helpful--I saw your previous posts, but it wasn't 100% clear.

Some other quick comments:

maybe a solution for this situation would be if 'sequenceLayer' could also accept and return XRaw or another type of object that allows multi-byte values (i.e. range bigger than 0 - 255)?

XRaw wouldn't be feasible here, since it also expects a single byte per position...XInteger is probably closer to what you'd want, since you need the full 0:255 range and additional values outside that range. If that indeed turns out to be the best and most feasible solution, that would happen in the GenomicAlignments package itself (which Hervé maintains).

Just note, this is a separate issue from the above visualization issue.

Yep--I'm in agreement with you that adding a visualization option for ""invalid"" values should be incorporated in some way. That implementation is prototyped, but I need to finish building out some testing suites before it's added.

gevro commented 2 months ago

Thanks so much.

The alignment of the ip or pw tags from query -> reference does not depend or require the actual read or reference sequences. I am just giving sequenceLayer a BStringSet that contains the ip/pw tags and the CIGAR string.

In other words, the mapping relies on the CIGAR string of the read, and it does not require any info about the read or reference sequences. And it is not using the ip / pw values themselves for the mapping.

Your suggestion of adding the ip and pw values as extra metadata to the read sequence, so that ip and pw do not need to be loaded as BString is interesting. Then doing sequenceLayer, and then extracting the corresponding locations from the metadata -- I'm not sure how that would work though? Specifically this step you wrote "then subset them with the returned subsetting of sequenceLayer".

How would I use the result of sequenceLayer be used to extract ip/pw values in reference coordinates, and also doing so without confusing between injected 'insertion' or 'deletion' ASCII characters?

See toy example:

library(GenomicAlignments)
DNA1 <- DNAString("ATTTCTAGAGA")
DNA2 <- DNAString("CGGGCTGACGC")
DNASet <- DNAStringSet(list(DNA1,DNA2))

ip1 <- BString("2flsd9f0slm")
ip2 <- BString("mkwi9v0-m31")
ipSet <- BStringSet(list(ip1,ip2))

cigar1 <- "5M1X2I1M1D1M"
cigar2 <- "5M1I2D1M2M"
cigars <- c(cigar1,cigar2)

metadata(DNASet)$ip <- ipSet

DNASet.layered <- sequenceLayer(DNASet,cigars)

metadata(DNASet.layered) # EMPTY
ahl27 commented 2 months ago

Sorry for the slow reply--I've been thinking about this more over this week and looking into the codebase.

First, a temporary workaround:

subset_arbitrary_values_by_cigar <- function(xv, cigar){
    # query -> reference mapping
    ops_to_remove <- c("I", "S")
    ops_to_inject <- c("D", "N")
    inject_at <- cigarRangesAlongQuerySpace(cigar, ops=ops_to_inject)[[1]]
    inject_w <- rev(explodeCigarOpLengths(cigar, ops=ops_to_inject)[[1]])

    filler_regions <- as(rep(0, max(inject_w)), class(xv))
    filler_starts <- rev(attr(inject_at, "start"))

    # inject 0 filler to insertion regions
    # subseq isn't vectorized, it seems, so have to loop
        # going in reverse so the positions don't change from insertion
    for(i in seq_along(filler_starts))
        subseq(xv, start=filler_starts[i], width=0L) <- subseq(filler_regions, start=1L, width=inject_w[i])

    # Delete regions
    # transform the cigar string now that we've inserted
    ops_to_inject <- paste(ops_to_inject, collapse='')
    cigar <- gsub(paste0("[", ops_to_inject, "]"), "M", cigar)
    delete_at <- cigarRangesAlongQuerySpace(cigar, ops=ops_to_remove)[[1]]

    total_range <- IRanges(1, width=length(xv))
    xv <- xv[setdiff(total_range, delete_at)]

    xv
}

This function takes in an XVector object and a cigar string, and subsets it by (1) removing regions that should be removed, and (2) inserting 0 values in insertion points. To demonstrate with your example (and another quick one I mocked up):

DNA1 <- DNAString("ATTTCTAGAGA")
DNA2 <- DNAString("CGGGCTGACGC")
DNA3 <- DNAString("ATGCCAAGGAT")
DNASet <- DNAStringSet(list(DNA1,DNA2,DNA3))

cigar1 <- "5M1X2I1M1D1M"
cigar2 <- "5M1I2D1M2M"
cigar3 <- "3M2I1M2D1M1D1I3M"
cigars <- c(cigar1, cigar2, cigar3)

# You could use any character/raw string with something like:
# xv1 <- as(as.integer(charToRaw(ip1)) + 1L, "XInteger")
# Adding 1 because 0 is reserved for gap characters. XInteger supports >255, so this isn't a problem.

# I'm going to just use integers to make the output more easily verifiable
ip_codes <- as(1:11, "XInteger")

sequenceLayer(DNASet,cigars)
## DNAStringSet object of length 3:
##    width seq
## [1]    10 ATTTCTA-GA
## [2]    12 CGGGC--GACGC
## [3]    11 ATGA--A-GAT

subset_arbitrary_values_by_cigar(ip_codes, cigar1)
## XInteger of length 10
##  [1]  1  2  3  4  5  6  9  0 10 11

subset_arbitrary_values_by_cigar(ip_codes, cigar2)
## XInteger of length 12
##  [1]  1  2  3  4  5  0  0  7  8  9 10 11

subset_arbitrary_values_by_cigar(ip_codes, cigar3)
> subset_arbitrary_values_by_cigar(ip_codes, cigar3)
## XInteger of length 11
##  [1]  1  2  3  6  0  0  7  0  9 10 11

It's not perfect (would be nice if it operated on XVectorList objects instead of just XVector objects), but it does seem to get the job done. I'm assuming you're mapping query -> reference here, but you could probably change around the ops_to_remove and ops_to_inject characters depending on usecase.


Next, a discussion on implementation.

This is unfortunately a bit of a tough spot regarding implementation. On the one hand, it's not super feasible to support values above 255 for XStringSet objects. On the other, these data are part of the PacBio standard, so it's definitely a valid request to have a way to parse that data as part of sequenceLayer and associated functions.

What are the possible solutions? As far as I can tell, there's three potentially viable options:

  1. don't support this usecase.
  2. support this via allowing sequenceLayer to take XVector as input rather than XStringSet
  3. build a custom XStringQuality class that can support this usecase.

Of those, I think (2) is likely not feasible. sequenceLayer is...relatively complicated, and would require a substantial amount of overhaul to work with XVector objects. I'm not working on GenomicAlignments, and afaik Hervé has a lot of higher priority tasks on his plate.

(1) is obviously the easiest, but doesn't really solve the issue.

I think (3) is likely the best solution. If I were to implement it, I'd probably do something like this:

For an example implementation, consider storing 2 BString objects per XString. The tags you're using range from 0-255, so they have 8 bits of information. Store the lower 4 bits in the first BString and the upper 4 bits in the second BString. With this, you could have representable BStrings (either use hex values or map them to ascii as bitwAnd(value, 0x0F) + 64L) for the lower bits, 0xF0 for the upper bits) without running into this issue of not having spare characters to map:

## Very quick implementation, this is likely not the most efficient
doubleBStringLayer <- function(ip_vals, cigars, ...){
    ip_vals <- lapply(ip_vals, charToRaw)
    lower_vals <- lapply(ip_vals, \(x) x & as.raw(0x0F))
    upper_vals <- lapply(ip_vals, \(x) x & as.raw(0xF0))

    # use ASCII representation
    #lower_vals <- vapply(lower_vals, \(x) intToUtf8(as.integer(x) + 64L), character(1L))
    #upper_vals <- vapply(upper_vals, \(x) intToUtf8(as.integer(rawShift(x,-4)) + 64L), character(1L))

        # use hexadecimal representation
    lower_vals <- vapply(lower_vals, \(x) paste(substring(as.character(x), 2, 2), collapse=''), character(1L))
    upper_vals <- vapply(upper_vals, \(x) paste(substring(as.character(x), 2, 2), collapse=''), character(1L))

    lowerBString <- BStringSet(lower_vals)
    upperBString <- BStringSet(upper_vals)

        # pass through any further args to sequenceLayer
    lowerBString <- sequenceLayer(lowerBString, cigars, ...)
    upperBString <- sequenceLayer(upperBString, cigars, ...)

    BStringSetList(upperBString, lowerBString)
}

ip1 <- "2flsd9f0slm"
ip2 <- "mkwi9v0-m31"
cigar1 <- "5M1X2I1M1D1M"
cigar2 <- "5M1I2D1M2M"
doubleBStringLayer(c(ip1, ip2), c(cigar1, cigar2))                                                                                    
## BStringSetList of length 2
## [[1]] 3667637-66 66763--32633
## [[2]] 26c3493-cd db799--0dd31

## these translate back to the original values when read vertically:
##   32, 66, 6C, 73, 64, ...
## =  2,  f,  l,  s,  d, ...

That approach would be extensible in the event that future sequencing runs encode data larger than 255, since it would be possible to encode values as arbitrarily many BString objects per nibble necessary.

Implementing that class would be slightly more challenging than just this function--you'd probably need additional methods to abstract the multiple strings into a cleaner user experience (e.g., custom show methods that translate the multiple BString objects into a single value, as.integer methods to map between this quality and the raw integer representation, etc.). That's probably slightly out of scope for Biostrings, but I think that implementing this extension would be a worthwhile Bioconductor package, if you're interested in making one.

So that brings my (current) final recommendation to a combo of (1) and (3)--leaving Biostrings and GenomicAlignments as-is, and implementing a custom XStringQuality class in a separate Bioconductor package for people interested in working with kinetics data on sequencing reads. I'm suggesting the secondary package route because it would be a lot of added code for a relatively specific usecase, but if you were to make an implementation and PR it to Biostrings I'd be willing to consider it. I'm not sure I have the time in the near future to add this functionality, but it may be feasible once everything in my roadmap is done.

gevro commented 2 months ago

HI Aidan, Thanks so much - very thoughtful and creative! From my user perspective, I think a separate class for kinetic data that internally splits the data into smaller BStrings is reasonable, and then adding a sequenceLayer method that can process those new kinetic data objects.

I agree this is a very specific use case. I anticipate it will grow with time, since this is the basis for methylation and other calls in PacBio data but still a small group of users. So putting this in a separate package makes sense to avoid interfering with GenomicAlignments and Biostrings.

My own coding level is not up to par to be able to implement this myself, but I'm happy to beta test in detail and stress test any implementation.

In the meantime, I will try your work-around, though I suspect it may be quite slow when run on millions of reads, since it is not fully vectorized. I will try at least.

ahl27 commented 2 months ago

No worries! Since I've been thinking about it, try just using this solution based on the doubleBString prototype that should scale much better and have your desired input format:

sequenceLayer_bctags <- function(values, cigars, ...){
    ## Input:
    ##   - values: either a list of integer or raw OR character strings
        ##   - cigars: cigar strings
        ##   - ... : further arguments to be passed to `sequenceLayer`
        ##
        ## Output: list of length(values) containing integers
        ##    - -1 is gap character, all other values converted to integers
    # values are integer numbers
    if(is(values, 'character'))
        values <- lapply(values, \(x) as.character(charToRaw(x)))
    else
        values <- lapply(values, \(x) as.character(as.raw(x)))

    lower_vals <- vapply(values, \(x) paste(substring(x, 2, 2), collapse=''), character(1L))
    upper_vals <- vapply(values, \(x) paste(substring(x, 1, 1), collapse=''), character(1L))

    lowerBString <- BStringSet(lower_vals)
    upperBString <- BStringSet(upper_vals)

    lowerBString <- sequenceLayer(lowerBString, cigars, ...)
    upperBString <- sequenceLayer(upperBString, cigars, ...)

    upper_bits <- strsplit(as.character(upperBString), '')
    lower_bits <- strsplit(as.character(lowerBString), '')

    retval <- vector('list', length(upper_bits))
    for(i in seq_along(upper_bits)){
        v <- paste0(upper_bits[[i]], lower_bits[[i]])
        outv <- numeric(length(v))
                 outv[] <- -1L
        p_valid <- v!='--'
        outv[p_valid] <- as.integer(paste0("0x", v[p_valid]))
        retval[[i]] <- outv
    }

    retval
}

Example Usage:

ip1 <- "2flsd9f0slm"
ip2 <- "mkwi9v0-m31"
ip_values <- c(ip1, ip2)
cigar1 <- "5M1X2I1M1D1M"
cigar2 <- "5M1I2D1M2M"
cigars <- c(cigar1,cigar2)

sequenceLayer_bctags(ip_values, cigars)
## [[1]]
##  [1]  50 102 108 115 100  57 115  -1 108 109
##
## [[2]]
##  [1] 109 107 119 105  57  -1  -1  48  45 109  51  49

## can also provide a list of ints
set.seed(123L)
ip_values <- list(sample(11), sample(11))
ip_values
## [[1]]
##  [1]  3 11  2  6 10  5  4  9  8  1  7
##
## [[2]]
##  [1] 11  5  3  9  1  4  7 10  6  8  2

# also able to change sequenceLayer args
sequenceLayer_bctags(ip_values, cigars, from='reference', to='query')
## [[1]]
##  [1]  3 11  2  6 10  5 -1 -1  4  8  1  7
##
## [[2]]
##  [1] 11  5  3  9  1 -1 10  6  8  2

Only note is that this assumes gap characters will be "-", if that's not a reasonable assumption you'd just have to modify the stuff in the for{...} loop.

Maybe down the line I can get around to making this a real class with better error checking etc. (or if anyone else sees this and wants to, that would be even better 😂)

gevro commented 2 months ago

Thanks so much! This is super helpful and is a great workaround.

I just tested it and it seems to work, but I'm curious if there is any way to speed up this part?

retval <- vector('list', length(upper_bits))
    for(i in seq_along(upper_bits)){
        v <- paste0(upper_bits[[i]], lower_bits[[i]])
        outv <- numeric(length(v))
                 outv[] <- -1L
        p_valid <- v!='--'
        outv[p_valid] <- as.integer(paste0("0x", v[p_valid]))
        retval[[i]] <- outv
    }

    retval
gevro commented 2 months ago

Actually it works decently fast. I can't figure out a faster (more vectorized) way to do this.

ahl27 commented 2 months ago

I'm not sure if it's faster, but a more "R" way to do it would be:

sequenceLayer_bctags <- function(values, cigars, ...){
    # values are integer numbers
    if(is(values, 'character'))
        values <- lapply(values, \(x) as.character(charToRaw(x)))
    else
        values <- lapply(values, \(x) as.character(as.raw(x)))
    lower_vals <- vapply(values, \(x) paste(substring(x, 2, 2), collapse=''), character(1L))
    upper_vals <- vapply(values, \(x) paste(substring(x, 1, 1), collapse=''), character(1L))

    lowerBString <- BStringSet(lower_vals)
    upperBString <- BStringSet(upper_vals)

    lowerBString <- sequenceLayer(lowerBString, cigars, ...)
    upperBString <- sequenceLayer(upperBString, cigars, ...)

    upper_bits <- strsplit(as.character(upperBString), '')
    lower_bits <- strsplit(as.character(lowerBString), '')

    res <- mapply(paste0, "0x", upper_bits, lower_bits, USE.NAMES=FALSE)
    res <- lapply(res, \(x) {
        v <- suppressWarnings(as.integer(x))
        v[is.na(v)] <- -1L
        v})
    res
}
gevro commented 2 months ago

Thanks! Yes I wrote something similar, but it's about the same speed as the for loop.

I also decided to keep the 'deletion' positions as NA instead of -1. This is better for downstream calculations that for example need to calculate a mean kinetics value. This allows removal of this line: v[is.na(v)] <- -1L