byuccl / bfat

Bitstream Fault Analysis Tool
Apache License 2.0
12 stars 1 forks source link

`find_fault_bits.py` doesn't work with designs that have the letter 'b' in them #10

Closed bunnie closed 2 years ago

bunnie commented 2 years ago

Steps to reproduce:

Example:

Design name is 'betrusted_soc'. Header looks like this:

00000000  00 09 0f f0 0f f0 0f f0  0f f0 00 00 01 61 00 2f  |.............a./|
00000010  62 65 74 72 75 73 74 65  64 5f 73 6f 63 3b 55 73  |betrusted_soc;Us|
00000020  65 72 49 44 3d 30 58 46  46 46 46 46 46 46 46 3b  |erID=0XFFFFFFFF;|
00000030  56 65 72 73 69 6f 6e 3d  32 30 32 30 2e 32 00 62  |Version=2020.2.b|
00000040  00 0c 37 73 35 30 63 73  67 61 33 32 34 00 63 00  |..7s50csga324.c.|
00000050  0b 32 30 32 32 2f 30 38  2f 30 32 00 64 00 09 31  |.2022/08/02.d..1|
00000060  34 3a 31 35 3a 33 38 00  65 00 21 72 8c ff ff ff  |4:15:38.e.!r....|
00000070  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
00000080  ff ff ff ff ff ff ff ff  ff ff ff ff ff 00 00 00  |................|
00000090  bb 11 22 00 44 ff ff ff  ff ff ff ff ff aa 99 55  |..".D..........U|

This line:

https://github.com/byuccl/bfat/blob/42c4a788721a6729a65585d0cc3ae371e69ed580/bitread.py#L98

Finds the letter b (ord value 98) in the design name and decides that is the beginning of the part number record, but it is actually the design name.

Workaround:

dd if=betrusted_soc.bit of=betrusted_soc_trunc.bit skip=30 bs=1

Will simply lop off the name of the design and allow the script to run on the resulting _trunc.bit file.

A more permanent solution might be to parse the bitstream to look for a more robust sentinel. I'm not so familiar with the .bit format to recommend what that would be, but maybe searching for the trailing and leading '00', so a sequence of [0x00, 0x62, 0x00], would be robust since the file name is terminated with a ; character and not a null. See also http://www.pldtool.com/pdf/fmt_xilinxbit.pdf. This sequence would work for any part number length that is shorter than 255 characters (a longer length would put a 0x01 after the 0x62), but I don't know of any Xilinx part numbers that are that long.

ethanrcampbell02 commented 2 years ago

I fixed that oversight, thanks for letting us know!

bunnie commented 2 years ago

Still trying to get the tool to run -- how long does it usually take to generate output on a medium-sized design (XC7S50 at 70% utilization)? I've let the script run for 24 hrs now, probably I've configured something wrong I'm guessing and this is not the expected runtime...

ethanrcampbell02 commented 2 years ago

You are correct, that is not the expected runtime. I am aware of some performance issues that might occur on smaller designs (funnily enough) when running find_fault_bits.py.. I am working on fixing that right now, but again that only really applied to a tiny design with a single utilized CLB tile. I will let you know when I push those fixes up as it does rework some of the functions a bit which may solve your issue.

It will also include a debug flag which will print some timing information whenever important functions finish, so we could potentially narrow down where the issue is occurring if it is not fixed in my next commit.

In the meantime, are there any optional flags that you set on the command line when running find_fault_bits.py? Have you sourced a version of Vivado?

EDIT: I forgot to mention that typical runtime on my machine has been 90-120 seconds, so if it has taken 24 hours something has definitely gone wrong.

mithro commented 2 years ago

@bunnie - it might be helpful to share your design files with them.

wirthlin commented 2 years ago

Is there a verbose mode that gives feedback that BFAT is doing something?

It could be that there are too many bits to evaluate. BFAT is designed to evaluate a handful of bits and we haven't looked at large bit lists yet.

We do know that the Vivado interface is very slow and we are working on adding a RapidWright interface to speed things up.

ethanrcampbell02 commented 2 years ago

@wirthlin - There is somewhat of a verbose mode for the main BFAT tool (outputs timing details after major functions return), and I have made one for find_fault_bits.py as well but it has not been made public yet.

Though, I doubt that the issue is too many bits, as they are running find_fault_bits.py which only generates a list of 7 fault bits. The issue is likely within find_fault_bits.py itself.

ethanrcampbell02 commented 2 years ago

@bunnie - Commit 415fab1 should improve the performance of the script. I can't say for sure whether it will fix your problem without running your design through it myself, though. I also added a flag -d which will print timing details after major function returns in the tool. So, if you still run into issues and do not want to send your design files, I would appreciate it if you could send the console log so I can determine which function you are getting stuck in.

bunnie commented 2 years ago

Having the latest vivado helped a lot. I was hoping it could run under 2020.1 but it wasn't, so I bit the bullet and did a 2022.1 install. The tool is running now and generating output, takes about a minute to run.

I'm using Python 3.8...I noticed your docs say to use Python 3.9. Usually I'm able to get away with a slightly older Python tho... but, could this explain why every time I run the tool, I get a different answer that seems to consist of exactly one line of JSON?

For example, on one run I get this:

[
    [
        [
            "00000a1a",
            "096",
            "31"
        ]
    ],
    [
        [
            "00001487",
            "065",
            "06"
        ]
    ],
    [
        [
            "0000148e",
            "065",
            "07"
        ]
    ],
    [
        [
            "0000148a",
            "065",
            "07"
        ]
    ],
    [
        [
            "0000002a",
            "000",
            "00"
        ]
    ],
    [
        [
            "00020220",
            "097",
            "15"
        ],
        [
            "00001595",
            "093",
            "07"
        ]
    ]
]

and then the next run -- on the exact same bitstream and dcp file -- I get this:

[
    [
        [
            "00000da2",
            "068",
            "31"
        ]
    ],
    [
        [
            "00400415",
            "070",
            "09"
        ]
    ],
    [
        [
            "00020105",
            "030",
            "04"
        ]
    ],
    [
        [
            "0002010e",
            "030",
            "27"
        ]
    ],
    [
        [
            "0000002a",
            "000",
            "00"
        ]
    ],
    [
        [
            "0040039a",
            "000",
            "15"
        ],
        [
            "00020215",
            "095",
            "07"
        ]
    ]
]

I'd like to avoid upgrading Python on this machine just because it's going to affect every other thing I do with it, if I can. Vivado was just more about freeing up the disk space; setting the version is easy enough with a local script. Could also be maybe there is a way to specify a second Python version for a specific environment and I just don't know that trick, but mostly I just run with what Ubuntu packages and ships in 20.04 LTE.

I can also upload the bitstream and DCP if it helps.

bunnie commented 2 years ago

Ah. I have just discovered you can install Python 3.9 on Ubuntu 20.04 and it doesn't overwrite the system default, and you can just run it as python3.9.

So -- now I'm running Vivado 2022.1, Python 3.9, on a design compiled using Vivado 2022.1, and finding that I still have that same behavior -- I'm reported exactly one failure in certain categories but the exact failure changes from run to run, even though the inputs are the same.

I'll open a new issue for this.

ethanrcampbell02 commented 2 years ago

That is actually the intended output of the script. It is supposed to generate an arbitrary list of failt bits which create some of the most common types of faults you will see when running BFAT, such as a LUT initialization bit flip or an open within a net. It will not generate the same output every time.