CajuM / fb-graphql-schema

Decoding Facebook's GraphQL API schema
33 stars 13 forks source link

flatc segmentation fault with FB APK version 338 #1

Open visualsayed opened 3 years ago

visualsayed commented 3 years ago

Hi,

Thank you very much for your effort developing this lib.

I alway got the following error when using "flatc --json --strict-json --raw-binary graph_metadata.fbs -- graph_metadata.bin ":

zsh: segmentation fault flatc --json --strict-json --raw-binary graph_metadata.fbs --

I think that the graph_metadata.bin file is corrupted in APK version 338.0.0.13.118 as it has the ext. graph_metadata.bin.xzs and uncompress it resulted in a corrupted file.

Could you please try it with latest version if you have time or give me a hint here?

Thanks, Sayed

CajuM commented 3 years ago

Hello,

I've encountered this issue as well. I think it's due to Facebook having changed their FlatBuffers schema in the meantime. I don't think that graph_metadata.bin is corrupted, it extracted successfully with xz, which includes a checksum of the compressed contents.

graph_metadata.fbs will have to be re-engineered again for this version of the APK. Although, a quick look with strings at the uncompressed graph_metadata.bin does not show significant modifications in the GraphQL API.

visualsayed commented 3 years ago

Hi CajuM,

Thank you very much for your reply.

Yes you are right it's not corrupted file, I find out that the last working version with old FlatBuffers schema is v293.

Thanks a lot for your great support, I will wait for your updated version of graph_metadata.fbs if you have enough time to work on, it is greatly appreciated.

visualsayed commented 3 years ago

Hi CajuM,

Could you please clarify the way to re-engineer the "graph_metadata.fbs"?

Thanks

CajuM commented 3 years ago

I wrote a blog post once, I'm not sure if it's explained well enough https://cajum.github.io/fbgraphql/

harsh-im commented 1 year ago

Hey @CajuM, Hope you are well

Have you updated or re-engineered the "graph_metadata.fbs" file? If not can you please explain how to re-construct the same.

Thanks

CajuM commented 1 year ago

Hello, I don't plan on reverse-engineering it again. I tried explaining how to do so in the blog post I mentioned, should I rewrite it to be more explicit about the decoding step?

harsh-im commented 1 year ago

Hello @CajuM ,

Thanks for the reply.

I have just started exploring the field of reverse engineering things. If you have some time then please rewrite the blog and explain the decoding step explicitly. It will be very helpful for me and people to come.

Thanks in advance!

harsh-im commented 1 year ago

Hey @CajuM ,

Any updates?

Aziz-code commented 1 year ago

Hey guys,

can you please help me understand the decoding process, it is really important for me to know this.

Please any help/hints will be appreciated.

Thanks

CajuM commented 1 year ago

Ok, untill I find the time to document this properly... https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md and https://flatbuffers.dev/md__internals.html

We'll be using the fbs.py module provided in this repo. fbs assumes that the binary is stored in a buffer. We'll call it data.

As per the documentation, a FlatBuffers table begins with an offset to the root table. So far we know that at address 0 we have a uint32_t which we must de-reference to get the offset of the root table. We introduce the following function: def deref_offset(data, offset, reverse=False): as previously mentioned data is our buffer, offset points to the datatype of interest in our buffer and reverse is used when we de-reference in the opposite direction.

So, to get the root table's offset we'll do root = deref_offset(data, 0). A table has associated with it a piece of metadata called a vtable, the offset is stored in a int32_t at offset 0, relative to the table start, in our case it's root. We'll introduce the def get_table_vt(data, offset): function to decode the vtable. This will return the table's length, and the number of entries in it, including optional ones. It is important to decode the vtable as a table can have optional elements and can also contain padding. The entries array is composed of offsets inside the table pointing to inline elements, together with the table length this can help our heuristic of deducing their lengths and types. In our case we'll call the function like so: vt_len, tbl_len, entries = get_table_vt(data, root). In this case we could get an entries vector like so: entries = [4, 8, None, 12] and a tbl_len = 16. Take note, the offsets are always greater or equal to 4, given the vtable offset and they can be padded. We can assume that scalar types are always aligned to their size.

The simplest assumption here would be that we have three uint32_t fields, but this need not be the case, we can also have structs, arrays, vectors, strings or other tables. In case of the later three, we can test our hypothesis by attempting to decode the data-type, if decoding fails or we get absurd values such as offsets that exceed the buffer, non utf-8 strings, we can assume our hypothesis is false. In the case of structs, they can be identified by a length that is not the power of two, however structs can also have the same length as scalars, this ambiguity can be solved by confronting it with the alternative hypothesis, instead of a struct A {a1: uint16_t, a2: uint16_t} we could have an uint32_t or a padded uint16_t. One heuristic we can apply is to check the values of each variant, if we have an A.a2 that is always zero or an uint32_t that is always a multiple of 65536 we have a padded uint16_t. Checks on values need not be limited to the data-type stored, we can also verify them semantically, that is if in the given context a certain type and purpose would fit better. For example, we can check if an uint16_t is an offset inside a vector`.

So far we have discussed only table ambiguities, but these also apply to vectors, A vector begins with an int32_t which is the element count, but this does not tell us it's element type or length. Here we may test various data types, however structs or arrays may still pose ambiguities. As such looking at a hex dump of the data may help. One can also verify that the vector does not overlap with other data structures. Strings are vectors that are null terminated and most likely contain utf-8 data.

A top-down approach can also be employed, if one knows the data-type of an element, one can test if there are any offsets to it.

CajuM commented 1 year ago

I'm curious, what is it that you need this so badly for?