Do you ignore the first block?

lud commented 3 years ago

Hi,

I'm truing to do an implementation in Elixir, following the HACKING document, but I get unknown values for the first byte of some chunks. I tried to follow the C code but I have never learned C and it is hard!

I found this piece of code and it looks to me that the first block is ignored as you start on block index 1. Block 0 is not the header block.

    int next_block = 2;
    int *blocks_visited = calloc(file->num_blocks, sizeof(int));
    do {
        fmp_block_t *block = file->blocks[next_block-1];

Can you confirm?

Thank you

evanmiller commented 3 years ago

@lud Yes, I ignore the first block in FP5 files.

https://github.com/evanmiller/fmptools/blob/ea1df02c2e6b2504c800888ed4fc621ce4f07b48/HACKING#L68-L73

lud commented 3 years ago

Hi !

I saw that you use fseek with two times the sector size in case of fp5 file format.

But in my case I have to find data burried into an obscure fp7 file, not fp5. I changed your code to initialize next_sector to 1 instead of 2 and I get an error of an unknown chunk code with the same value (first chunk byte) as with my elixir code.

So it seems to me that you start handling sectors from the file->blocks[2 - 1] (aka sector 1 and not 0). The fact that the first sector is ignored in fp5 files is another matter (using fseek with 2 * sector size before loading blocks from fp5 files). But obviously I may be wrong here.

I don't know C lang but when I see this:

fmp_error_t process_blocks(fmp_file_t *file,
        block_handler handle_block,
        chunk_handler handle_chunk,
        void *user_ctx) {
    fmp_error_t retval = FMP_OK;
    /*
     ...
     */
    int next_block = 2;
    int *blocks_visited = calloc(file->num_blocks, sizeof(int));
    do {
        fmp_block_t *block = file->blocks[next_block-1];
        retval = process_block(file, block);
        blocks_visited[next_block-1] = 1;

I guess that the first block that is passed to process_block is file->blocks[2-1] a.k.a file->blocks[1] right?

Anyway I should try to code against your sample files in the test directory. I compiled and successfully ran the sqlite version but the data I need to retrieve was not available in the sqlite database. It was just NULL. This data are images. I don't even know how they are stored haha, probably as raw binary but declared as text. I just know they are here.

How did you know about the file structure described in the HACKING file? I added printf calls in you C version and found that if I took one more byte from the file for each chunk I got the same first-chunk-byte (I call them "tags") as the C code. So my understanding of the HACKING file is wrong in some way but I don't know exactly how.

evanmiller commented 3 years ago

If you're looking at fp7 files, you'll want to follow the _v7 logic in the code and the FMP12 notes in the HACKING file. The chunk structures are very different between the fp3/fp5 and fp7/fmp12 formats.

For debugging, try running the included fmpdump file on the file that you're interested in.

lud commented 3 years ago

The _v7 logic is called from process_block(, but my problem is before, in process_blocks (plural).

I am not sure, reading the HACKING document, of the global structure of the fp7 file.

When running the dump on the test file data.fp7 I can see the following output:

== 0 -> [ BLOCK 2 ] -> 8 ==
== 2 -> [ BLOCK 8 ] -> 9 ==
== 8 -> [ BLOCK 9 ] -> 10 ==
== 9 -> [ BLOCK 10 ] -> 11 ==
== 10 -> [ BLOCK 11 ] -> 7 ==
== 11 -> [ BLOCK 7 ] -> 5 ==
== 7 -> [ BLOCK 5 ] -> 16 ==
== 5 -> [ BLOCK 16 ] -> 17 ==
== 16 -> [ BLOCK 17 ] -> 12 ==
== 17 -> [ BLOCK 12 ] -> 13 ==
== 12 -> [ BLOCK 13 ] -> 18 ==
== 13 -> [ BLOCK 18 ] -> 14 ==
== 18 -> [ BLOCK 14 ] -> 3 ==
== 14 -> [ BLOCK 3 ] -> 6 ==
== 3 -> [ BLOCK 6 ] -> 15 ==
== 6 -> [ BLOCK 15 ] -> 4 ==
== 15 -> [ BLOCK 4 ] -> 0 ==

As you can see, we start on block 2. Block 1 does not seem to be read.

That seem to be normal, reading the code:

fmp_error_t process_blocks(fmp_file_t *file,
        block_handler handle_block,
        chunk_handler handle_chunk,
        void *user_ctx) {
    fmp_error_t retval = FMP_OK;
    /*
    fmp_block_t *block = file->blocks[0];                                        <-- block index 0 is commented
    process_block(file, block);
    if (!handle_block || handle_block(block, user_ctx))
        process_chunk_chain(file, block->chunk, handle_chunk, user_ctx);
        */
    int next_block = 2;                                                          <-- start with next_block = 2
    int *blocks_visited = calloc(file->num_blocks, sizeof(int));
    do {
        fmp_block_t *block = file->blocks[next_block-1];                         <-- starting at index 1  (next_block - 1)
        retval = process_block(file, block);
        blocks_visited[next_block-1] = 1;
        if (retval != FMP_OK) {
            /*
            fprintf(stderr, "ERROR processing block, reporting partial results...\n");
            block->this_id = next_block;
            if (!handle_block || handle_block(block, user_ctx))
                process_chunk_chain(file, block->chunk, handle_chunk, user_ctx);
                */
            break;
        }
        block->this_id = next_block;                                             <-- assign this_id to 2
        if (!handle_block || handle_block(block, user_ctx))                      <-- handle_block will call start_block for dumps, and this will print "BLOCK 2"
            retval = process_chunk_chain(file, block->chunk, handle_chunk, user_ctx);

The block with index 0 is created here:

    if (!fread(sector, file->sector_size, 1, file->stream)) {
        retval = FMP_ERROR_READ;
        goto cleanup;
    }

    printf("%s\n", "first_block");
    first_block = new_block_from_sector(file, sector, &retval);
    if (!first_block)
        goto cleanup;

    if (first_block->next_id == 0 ||
        (first_block->next_id + 1 + (file->version_num < 7)) * file->sector_size != file->file_size) {
        retval = FMP_ERROR_BAD_SECTOR_COUNT;
        goto cleanup;
    }

    file = realloc(file, sizeof(fmp_file_t) + first_block->next_id * sizeof(fmp_block_t *));
    if (!file) {
        retval = FMP_ERROR_MALLOC;
        goto cleanup;
    }
    file->num_blocks = first_block->next_id;
    file->blocks[0] = first_block;

I read the documentation of fread from the C language, and if I understood correctly the file pointed is moved, so when fread is called from read_header, the pointer was moved, so this block with index 0 is not the header.

In the data.fp7 file, the block with index 0 has 18 as the next ID, just like the "BLOCK 13" (of index 12 I guess). It also has 0 as the previous ID. But we never see == 0 -> [ BLOCK 1 ] -> 18 == in the output.

So, to me it looks like there is the header block (4096), then the block of index 0 (BLOCK 1), then the remaining blocks, starting from index 1. And process_blocks starts with this index 1 and never goes to the block with index 0.

Now I don't want to bother you with debugging, I just hope that it will recall you of something and maybe you could confirm that my understanding is correct, and maybe why you commented the "block index 0" section out.

Thank you.

evanmiller commented 3 years ago

The first block that is skipped appears to contain file metadata of some kind. I haven't fully reverse-engineered it, but in my work it's been safe to skip. The "next ID" in that first block (18 in your case) indicates the total number of blocks – so if you started parsing there, you'd go to the last block in the file (0 => [ BLOCK 18 ] => 14) and miss half of the chain.

If there's important data in the first block of one of your files, I'd be happy to add a file to the test collection and try to figure out what's happening.

lud commented 3 years ago

Hi @evanmiller thank you for confirming what I thought.

I could work a little bit on that project and I successfully parse all the chunk structures from the test file, indeed ignoring the first block.

I now have to rebuild the data from the different chunks before I can test on my own huge file. I hope that the first block will be insignificant here too, but otherwise I will let you know.

By the way, I think there is an error in the HACKING file for simple data. Those descriptors seem off:

    Offset  Length  Value
    0       1       (0x19 | 0x23)
    1       1       Value (Bytes)        

    Offset  Length          Value
    0       1               0x1A <= C <= 0x1D
    1       2*(C-0x19)      Value (Bytes)

I tried to understand the C code and I found that for 0x19 .. 0x1D the data length is given at the second position in the chunk, but we need to update the next chunk start position further (2 * C - 0x19) when C > 0x19, 1 when C == 0x19.

It seems to work as my code continues and find the same next chunks as you.

Now I did not dig too much in your code to know how you read the data from block->data.len but I guess that you take the data.len amount of bytes starting from the data.bytes reference. So there are skipped bytes.

        } else if (c >= 0x19 && c <= 0x1D) {
            chunk->type = FMP_CHUNK_DATA_SIMPLE;
            p++;
            if (p >= end) {
                retval = FMP_ERROR_DATA_EXCEEDS_SECTOR_SIZE;
                free(chunk);
                break;
            }
            chunk->data.len = *p++;
            chunk->data.bytes = p;
            p += chunk->data.len + (c == 0x19) + 2*(c-0x19);     <-- moving p further than data length
        }

That would translates to that Elixir code. I did not encounter a 0x23.

  # Offset  Length  Value
  # 0       1       (0x19 | 0x23)
  # 1       1       Value (Bytes)
  ##
  #
  # defp parse_chunk(<<tag, byte, rest::binary>>) when tag in [0x19, 0x23] do
  #   chunk = %{c: tag,t: :simple_data, value: byte}
  #   {chunk, rest}
  # end

  defp parse_chunk(<<0x23, rest::binary>>) do
    raise "todo handle 0x23"
  end

  defp parse_chunk(<<0x19, len, rest::binary>>) do
    tag = 0x19
    buffer_len = 1
    <<value::binary-size(len), buffer::binary-size(buffer_len), rest::binary>> = rest

    chunk = %{c: tag, t: :simple_data, value: value}
    {chunk, rest}
  end

  # Offset  Length          Value
  # 0       1               0x1A <= C <= 0x1D
  # 1       2*(C-0x19)      Value (Bytes)
  #
  #
  # The length seems to be given before at byte 1, but we remove a larger chunk
  # from the "rest"
  defp parse_chunk(<<tag, len, rest::binary>>) when tag in 0x1A..0x1D do
    buffer_len = 2 * (tag - 0x19)

    <<value::binary-size(len), buffer::binary-size(buffer_len), rest::binary>> = rest

    chunk = %{c: tag, t: :simple_data, value: value}
    {chunk, rest}
  end

So to me the docs would looke like the following:

Offset  Length  Value
0       1       0x19
2       1       N = Length (Integer)
3       N       Value (Bytes)
3+N     1       Unknown

Offset  Length          Value
0       1               0x1A <= C <= 0x1D
1       1               N = Length (Integer)
3       N               Value (Bytes)
3+N     2*(C-0x19)      Unknown (Bytes)

0x23 seems to be the basic tag/length/data without buffer

evanmiller commented 3 years ago

@lud It's possible/likely the descriptors are off - feel free to open a PR with a proposed correction. It might be helpful to run it on some files to make sense of the "Unknown" part. Possibly it's a key-value pair rather than a "simple data" structure. Basically I just called anything that I couldn't make sense of "simple data" and skipped it.

lud commented 3 years ago

Hi @evanmiller , this is still on my todo list but I have been working for another project since my last comment. I am not sure I will be able to find what the data is for (I am still astonished that you figured out all structures just by inspecting files). But If my docs ammends (based on your C code) work I will submit a PR.

Cheers :)

evanmiller commented 3 years ago

Sounds good, just cleaning up old discussions.

evanmiller / fmptools

Do you ignore the first block? #4