elm-toulouse / cbor

🤖 An Elm library implementing: RFC 7049 Concise Binary Object Representation (CBOR)
http://cbor.io/
MIT License
10 stars 1 forks source link

Expressive power #1

Closed simonh1000 closed 3 years ago

simonh1000 commented 3 years ago

Hi, I was excited to find your library as I investigate the EU covid passport, as it uses CBOR. However, I am unable to find a way to extract the data I wanted. Here is the code for node-cbor that does work:

const results = cbor.decodeAllSync(decompressed);
const result = results[0];
[headers1, headers2, cbor_data, signature] = result.value;

const greenpassData = cbor.decodeAllSync(cbor_data);
const userData = greenpassData[0].get(-260).get(1);

The issue (or at least my first one is) is that result.value looks like

[
    <Buffer a2 01 26 0...>,
    {},
    <Buffer a4 01 62 4... 179 more bytes>,
    <Buffer 7e a4 63 4f 96  ... 14 more bytes>
  ]

Note that is not homogeneous. In practise, I want index 2, but there is neither an index nor a oneOf decoder that I could use.

Can you see a way to do this?

mpizenberg commented 3 years ago

@KtorZ if you have some time to look at this

mpizenberg commented 3 years ago

@simonh1000 from reading a bit the cbor spec, it seems to me that your node code is taking advantage of the fact that CBOR in theory does not need a schema and can be decoded as a succession of "stuff" where "stuff" is one of a few major types (numbers, strings, bytes, arrays and maps). And so decodeAllSync generates a list of "stuff" with different types, and you only care about some of that stuff, at a given position.

From reading the decoding API in the elm-toulouse/cbor, it seems to me there is no type defined to cover all things that may appear in a CBOR encoded bytes sequence. Something like

type CborData
  = CborInt Int
  | CborFloat Float
  | ...

And since there does not seem to be such a type, it seems to me that a function like decodeAllSync is currently not possible. I think the current API has been designed in scenarios where users know the schema of the data and want to decode it all. Instead your use case seems to be that you only know a "partial schema" or at least, you are only interested in part of the data.

Am I correct in my understanding of your issue?

simonh1000 commented 3 years ago

I don't specifically need generic decoding. The top level data is - I think - a tagged list, of which I want the 2nd element. After thinking some further I thought I could get the following to work, but something is still not right (on re-reading the docs, perhaps I should not have expected maybe to do what i needed?).

    -- CD.tagged (CDTag.Unknown 18) (CD.list <| CD.succeed ())
    CD.tagged (CDTag.Unknown 18) (CD.list (CD.maybe CD.bytes))
        |> CD.andThen
            (\( _, lst ) ->
                CD.succeed lst
             --case lst of
             --    _ :: _ :: gpData :: _ ->
             --        CD.succeed gpData
             --
             --    _ ->
             --        CD.fail
            )

The commented out code 'works' (it reports ()), which seems confirm I'm getting some of the shape correct, but the uncommented code fails, and I'm not yet sure why

I suspect that CDTag.Unknown 18 is also too specific for data in general, but this is what my (Belgian) data uses.

KtorZ commented 3 years ago

@simonh1000 do you have by any chance a little excerpt of the CBOR-encoded string you're trying to decode?

In your excerpt above, I find the use of andThen suspicious, for serialized structures like that are rarely nested though sometimes do embed cbor-in-cbor.

simonh1000 commented 3 years ago

I'd rather not share my data, but you can extract something similar from you EU 'pass sanitaire' - it's the data in the QR code. You have to remove the "HC1:" at the beginning to get a base45 string that you need to decode and then inflate.

Here's my full code:

module Main exposing (..)

import Base45
import Cbor.Decode as CD
import Cbor.Tag as CDTag
import Inflate

raw =
    "NCF...TFB-D"

init _ =
    let
        _ =
            raw
                |> Base45.decode
                |> Result.andThen (Inflate.inflateZLib >> Result.fromMaybe "inflate failed")
                |> Result.andThen (CD.decode decCbor >> Result.fromMaybe "cbor failed")
                |> Debug.log ""
    in
    ( (), Cmd.none )

decCbor =
    --CD.dict CD.string <| CD.succeed "something"
    --CD.list CD.bytes
    CD.tagged (CDTag.Unknown 18) (CD.list <| CD.succeed ())
        --CD.tagged (CDTag.Unknown 18) (CD.list (CD.maybe CD.bytes))
        |> CD.map
            (\( _, lst ) ->
                lst
             --case lst of
             --    _ :: _ :: gpData :: _ ->
             --        CD.succeed gpData
             --
             --    _ ->
             --        CD.fail
            )

--CD.tag
--    |> CD.andThen
--        (\tag ->
--            let
--                _ =
--                    Debug.log "tag" tag
--            in
--            CD.fail
--        )

update _ m =
    ( m, Cmd.none )

subscriptions _ =
    Sub.none

main : Program () () msg
main =
    Platform.worker
        { init = init, update = update, subscriptions = subscriptions }
KtorZ commented 3 years ago

Arf. I got my hands on some data and I see the issue now. The data is a tagged array, with heterogeneous elements: a bytestring, a map, another bytestring and another bytestring.

However, the library only allows to decode lists for which all elements have the same type but not arbitrary arrays. That'd be a nice feature to add.

simonh1000 commented 3 years ago

correct. As i only need specific indices of the array, an index function would work in this case too, but perhaps is a less general fix. I suspect I will have a related ask for Dictionaries once I get through the top level of the data

KtorZ commented 3 years ago

Hey, I had a quick stab at it this morning. Looking at the EU Digital Green Certificates, we can see that the outer-most structure is a tagged COSE envelope. Using the new primitives introduced in #2, you should be able to decode it as such:

type alias CoseEnvelope =
    { protected : Bytes
    , unprotected : ()
    , payload : Bytes
    , signature : Bytes
    }

let decoder = 
        D.tagged (Tag.Unknown 18) <|
          D.array <|
              D.map4 CoseEnvelope
                  D.bytes
                  (D.record <| D.succeed ())
                  D.bytes
                  D.bytes

(Note that the unprotected field really is an empty map in the specs). The payload and protected are then cbor-encoded structure, which can also be decoded (here you could use D.bytes |> D.andThen ...) if you wanted to do it in one go; I'll see maybe to also provide a nice primitive for that, like 'nested' or something like that).

If you're interested, I also found some nice test data in the official repository. For example, the second QR code, once base45-decoded and deflat gives you the following encoded COSE/CBOR bytestring: https://github.com/eu-digital-green-certificates/dgc-testdata/blob/main/FR/2DCode/raw/DCC_Test_0002.json#L27

Let me know if #2 helps, I'll take the time to make it a proper release somewhere this week.

simonh1000 commented 3 years ago

Sweeet - here's the final result

type alias CoseEnvelope =
    { protected : Bytes
    , unprotected : ()
    , payload : Bytes
    , signature : Bytes
    }

decCoseEnvelope =
    CD.tagged (CTag.Unknown 18) <|
        CD.array <|
            CD.map4 CoseEnvelope
                CD.bytes
                (CD.record <| CD.succeed ())
                CD.bytes
                CD.bytes

type alias GreenPass =
    { country : String
    , d1 : Int
    , d2 : Int
    , passData : PassData
    }

decGreenPass =
    CD.record <|
        CD.map4 GreenPass
            (CD.pair CD.int CD.string |> CD.map Tuple.second)
            (CD.pair CD.int CD.int |> CD.map Tuple.second)
            (CD.pair CD.int CD.int |> CD.map Tuple.second)
            (CD.pair CD.int decodePassOuter |> CD.map Tuple.second)

decodePassOuter : CD.Decoder PassData
decodePassOuter =
    CD.record <|
        CD.map Tuple.second <|
            CD.pair CD.int decodePass

type alias PassData =
    { vaccine : List Vaccine
    , dob : String
    , user : User
    }

decodePass =
    CD.record <|
        CD.map3 (\( _, v ) dob ( _, u ) -> PassData v dob u)
            (CD.pair CD.string <| CD.list decodeVaccine)
            ds
            (CD.pair CD.string decodeUser)

type alias Vaccine =
    { dose : Int
    , make : String
    }

decodeVaccine : CD.Decoder Vaccine
decodeVaccine =
    let
        dec1 =
            CD.map3 (\ci co dn -> ( co, dn )) ds ds di

        dec2 =
            CD.map3 (\dt is ma -> ()) ds ds ds

        dec3 =
            CD.map4 (\mp sd tg vp -> ()) ds di ds ds
    in
    CD.record <|
        CD.map3 (\( co, dn ) _ _ -> Vaccine dn co) dec1 dec2 dec3

type alias User =
    { given : String
    , family : String
    }

decodeUser : CD.Decoder User
decodeUser =
    CD.record <|
        CD.map4 (\fn gn _ _ -> User gn fn) ds ds ds ds

ds =
    CD.pair CD.string CD.string |> CD.map Tuple.second

di =
    CD.pair CD.string CD.int |> CD.map Tuple.second
KtorZ commented 3 years ago

I believe, now fixed in

https://package.elm-lang.org/packages/elm-toulouse/cbor/1.1.0

See the CHANGELOG on: https://github.com/elm-toulouse/cbor/releases/tag/1.1.0

Thanks for the feedback, it's heartwarming to see that this was any useful for someone :pray: Have great one!

simonh1000 commented 3 years ago

Et a vous merci beaucoup.