ARM-software / tf-issues

Issue tracking for the ARM Trusted Firmware project
37 stars 16 forks source link

libfdt: Import version v1.4.7 has bad impacts on performances #643

Closed nlebayon closed 5 years ago

nlebayon commented 5 years ago

On stm32mp1 platform, we are using Device tree and we observed a performance issue after the last rebase on libfdt 1.4.7 (https://github.com/ARM-software/arm-trusted-firmware/commit/630b011ffd6a83436581af71d770ddd9fe01f747).

For information, here are total times we spend in the FDT functions in both cases:

After a deeper analysis, we deduced that it was due to fdt32_ld / fdt64_ld function implementations, and more precisely to the memcpy operations inside, added for alignment reasons.

Note that on U-Boot community, as on Kernel one, they did not upgrade their FDT library on this 1.4.7 release, they are presently using libfdt 1.4.6.9. And even if we did not found any discussion on this subject on the dedicated github, we can suppose that they didn't switch to the latest release due to this performance degradation.

On our side, we had to modify these functions in order to remove the memcpy call thanks to a preprocessor conditional: static inline uint32_t fdt32_ld(const fdt32_t *p) { fdt32_t __unused v;

if FDT_NO_ALIGNMENT_CHECK

assert((*p & 0x3) == 0);
return fdt32_to_cpu(*p);

else

memcpy(&v, p, sizeof(v));
return fdt32_to_cpu(v);

endif

}

And the equivalent in fdt64_ld with a 0x7 mask. With this modification, we recover more or less libfdt 1.4.2 performances (only +6% gap).

Did anyone else observed this kind of issue? We tried to give as much information as possible.

ghost commented 5 years ago

Ok, this is quite unfortunate. I think that the right thing to do then is to downgrade libfdt. Do you know if 1.4.6.9 has the same problem?

nlebayon commented 5 years ago

libfdt 1.4.6.9 is fine, and presently used by master of KERNEL and U-BOOT. I made the downgrade test and I observed on my setup a 4% gap of performances, so even less that the patch I proposed above. Indded this is the best thing to do for the moment, you're right.

ghost commented 5 years ago

How do you know the patch version in libfdt? The repository only has tags for 1.4.6 and 1.4.7. Is it the number of patches on top of the tag?

nlebayon commented 5 years ago

I looked the last libfdt on kernel and u-boot repositories, and they stayed on this commit: https://github.com/dgibson/dtc/commit/aadd0b65c987d21776354800fdf24ad35c0ffe68 They aren't using an official tag. But if it can be useful, I can test with official 1.4.6 if you prefer to keep an official tag

ghost commented 5 years ago

Yes, I think that would be better.

ghost commented 5 years ago

And the patch you are pointing at is exactly 9 commits after 1.4.6, which makes sense.

ghost commented 5 years ago

Actually, wait, 1.4.6.9 has one commit that we want to have: https://github.com/dgibson/dtc/commit/65893da4aee04882affc30f3ff3265fa483bdb8e I guess we can use 1.4.6.9.

nlebayon commented 5 years ago

The commit introducing delays arrived later: https://github.com/dgibson/dtc/commit/6dcb8ba408ec18009083acaa9f85429afa39e453 So either we use the same as kernel, or we choose the commit just before this one.

nlebayon commented 5 years ago

I created an issue on DTC, let's wait for their answer also.

ghost commented 5 years ago

It's probably better to stick to the one that the kernel and U-Boot are using. As a matter of fact, we can probably start doing that as a rule (unless we need to update because of a bugfix).

nlebayon commented 5 years ago

I'm okay with your proposal yes.

ghost commented 5 years ago

I've raised this PR: https://github.com/ARM-software/arm-trusted-firmware/pull/1657

By the way, we were wondering how the functions can take that a long time to be done. Are you talking about milliseconds or microseconds?

nlebayon commented 5 years ago

Thanks for your PR, I'm gonna test it on my platform, in order to confirm that I obtain the expected results, and will do a +1 on it. Just before, I read your description, and you said "a big performance hit to functions that modify the FDT". In our case, we don't modify the DT, it is in read-only mode. So this performance issue affects all kinds of FDT accesses.

nlebayon commented 5 years ago

Regarding your second question, timing unit in the top message is milliseconds. You will find in the attached file a summary of all measures I did last week, with different libfdt releases. For each function, I measured each call and I did the sum. At the end, you'll find the overall timings I indicated earlier.

As you can see, some functions are quite long, this is due to device tree parsing time. Searching a compatible, a phandle or a parent can be very long indeed.

fdt_overall_timings.txt

ghost commented 5 years ago

I didn't notice that you were only reading the DT, I imagined that it was due to modifications, so this is actually quite surprising. Thanks for your detailed breakdown.

ghost commented 5 years ago

Also, could you point me at the DT issue that you have created?

nlebayon commented 5 years ago

When looking deeply inside the source code, we can observe that the fdt32_ld inline function (which introduces additional memcpy operations) is only used in fdt_ro.c, which contains read-only access functions. Same for fdt64_ld.