Closed Nathan-Stohs closed 10 years ago
Yes. Is there a proposal?
From: Nathan Stohs [mailto:notifications@github.com] Sent: Thursday, July 10, 2014 12:50 PM To: Samraksh/MF Subject: [MF] emote: use of floating point in MF (#164)
The other day I noticed that even in our basic Hello World app, the interpreter in TinyCLR was trying to call floating point library functions. Would be good to get a better understanding of the situation. Again this is with the interpreter, not one of our drivers or other implementations.
My feeling is that FP should not be something that just happens...
— Reply to this email directly or view it on GitHubhttps://github.com/Samraksh/MF/issues/164.
This is a straight-forward characterization... somebody just needs to trace through some executions and gain an understanding of when, why, where, and how often TinyCLR is trying to use floating point. From there we determine if action is needed. Or maybe the result is that I'm wrong and FP in fact does not happen.
At this point I haven't actually determined it is a problem, I just don't think this should happen and the issue merits investigation into why. Otherwise there is no smoking gun that this is for example causing a performance issue or anything like that.
I put the task up to keep it on the company radar and it is otherwise open. If somebody wishes to pick this up they can assign themselves. Otherwise I would presume to pick this up myself at some point in the future.
If you see floating point instructions inside libc functions, this may be due to how Mentor Graphics chose to compile newlib libc.a. Newlib may be configured to almost fully avoid using any floating point variables.
Otherwise, TinyCLR uses floating point variables when it manipulates the HeapBlock, and also whenever managed code does certain type conversions involving floating point numbers. For debug builds, CLR_RT_HeapBlock::StoreToReference calls the conversion function. It should be possible to set TINYCLR_EMULATED_FLOATINGPOINT and avoid all CLR calls to floating point math library functions.
I don't see anything in a vanilla HelloWorld that would trigger a conversion. Disassembly of HelloWorld:
.method public hidebysig static void Main() cil managed
{
.entrypoint
// Code size 20 (0x14)
.maxstack 1
.locals init ([0] bool CS$4$0000)
IL_0000: nop
IL_0001: br.s IL_0010
IL_0003: nop
IL_0004: ldstr "program2 Hello World!"
IL_0009: call void [Microsoft.SPOT.Native]Microsoft.SPOT.Debug::Print(string)
IL_000e: nop
IL_000f: nop
IL_0010: ldc.i4.1
IL_0011: stloc.0
IL_0012: br.s IL_0003
} // end of method HelloWorld::Main
So I compiled a different custom HelloWorldMF and opened HelloWorldMF.exe inside ildasm.exe (Intermediate Language Disassembler). (Unfortunately ildasm does not directly recognize the packed MSpotV1 PE files). Here is HelloWorldMF:
public static void Main() {
long counter = 0;
while (true) {
Debug.Print(
Resources.GetString(Resources.StringResources.String1) + (counter++).ToString());
System.Threading.Thread.Sleep(5000);
}
}
Here is the output of HellowWorldMF Main(), with "Show source lines":
.method public hidebysig static void Main() cil managed
{
.entrypoint
// Code size 58 (0x3a)
.maxstack 4
.locals init ([0] int64 counter,
[1] int64 CS$0$0000,
[2] bool CS$4$0001)
.language '{3F5162F8-07C6-11D3-9053-00C04FA302A1}', '{994B45C4-E6E9-11D2-903F-00C04FA302A1}', '{5A869D0B-6611-11D3-BD2A-0000F80849BD}'
// Source File 'c:\Users\researcher\Documents\Visual Studio 2012\Projects\HelloWorldMF\HelloWorldMF\HelloWorldMF.cs'
//000011: public static void Main() {
IL_0000: nop
//000012: long counter = 0;
IL_0001: ldc.i4.0
IL_0002: conv.i8
IL_0003: stloc.0
//000013: while (true) {
//000014: Debug.Print(
//000015: Resources.GetString(Resources.StringResources.String1) + (counter++).ToString());
//000016: System.Threading.Thread.Sleep(5000);
//000017: }
//000018: }
//000019:
//000020: }
//000021: }
IL_0004: br.s IL_0036
//000013: while (true) {
IL_0006: nop
//000014: Debug.Print(
IL_0007: ldc.i4 0x4cc
IL_000c: call string HelloWorldMF.Resources::GetString(valuetype HelloWorldMF.Resources/StringResources)
IL_0011: ldloc.0
IL_0012: dup
IL_0013: ldc.i4.1
IL_0014: conv.i8
IL_0015: add
IL_0016: stloc.0
IL_0017: stloc.1
IL_0018: ldloca.s CS$0$0000
IL_001a: call instance string [mscorlib]System.Int64::ToString()
IL_001f: call string [mscorlib]System.String::Concat(string,
string)
IL_0024: call void [Microsoft.SPOT.Native]Microsoft.SPOT.Debug::Print(string)
IL_0029: nop
//000015: Resources.GetString(Resources.StringResources.String1) + (counter++).ToString());
//000016: System.Threading.Thread.Sleep(5000);
IL_002a: ldc.i4 0x1388
IL_002f: call void [mscorlib]System.Threading.Thread::Sleep(int32)
IL_0034: nop
//000017: }
IL_0035: nop
//000013: while (true) {
IL_0036: ldc.i4.1
IL_0037: stloc.2
//000014: Debug.Print(
//000015: Resources.GetString(Resources.StringResources.String1) + (counter++).ToString());
//000016: System.Threading.Thread.Sleep(5000);
//000017: }
//000018: }
//000019:
//000020: }
//000021: }
IL_0038: br.s IL_0006
} // end of method HelloWorldMF::Main
It looks to me like there could be scaling going on at IL_0002 and IL0014. However, converting i4 to i8 does not trigger execution of the floating point section of Convert_Internal.
This does bring up a separate issue. While at OSU, I recall being frustrated that UInt16 C# variables were being converted before different operations, wasting even more cycles. In the above program, it looks like using a long variable causes MSIL instruction conv to be called even though the long is only printed out and incremented.
So, Nathan, I can't replicate this issue with the Interpreter calling floating point functions while executing Hello World. Were you able to use gdb's backtrace to find the offending CLR library?
The problem is the latter. A build with -mfloat-abi=softfp
will hard fault due to a bogus FP instruction. You mentioned HeapBlock, and that in fact is where we hit the skids, specifically from my backtrace:
CLR_INT32 CLR_RT_HeapBlock::Compare_Values( const CLR_RT_HeapBlock& left, const CLR_RT_HeapBlock& right, bool fSigned )
So it sounds like your intuition may be correct.
I too have pondered if we should just go ahead and use the fixed point mode if this is actually causing a performance issue. However I'm not sure of what improvement to expect, if any. But that is a fix or workaround, the real question here is why.
TINYCLR_EMULATED_FLOATING_POINT does not actually build at all. Setting this option just enables a fixed point engine however it won't actually build without additional pre-processor parameters (fixed point width, etc.). It's not even clear if it is actually supported or is some orphaned feature.
EDIT: Just wanted to point out that of course softfp
is not supposed to work because it requires an FPU. This was happening accidentally in master builds for a few days before it was fixed. But if there are no FP calls, then this bug should be moot (or should it? Maybe this is not true). So that trivial bug has potentially exposed something more subtle.
EDIT2: The hard fault occurs before GDB gets to a particular line of code, I suspect the fault hits early on as soon as the instruction gets in the pipeline. But I recall from the asm dump that the first thing that jumped out at me were the presense of VPOP or VPUSH... which now that I think about it may mean that we don't actually have a problem (see edit 3 below).
EDIT3: One possibility of course is that the build error is the only real error and this is all harmless. I want it to remain clear that there is no smoking gun or evidence of a crime here, just something that needs understanding. It could be that the bad build option was including completely unrelated boilerplate code.
I bet the compiler output FPA instructions and the HeapBlock debug code called a floating point conversion routine that triggered the hard fault.
"bad linking or bad redefinition theory" When you say "bogus FP instruction," do you mean a library call, or an ARM ISA opcode? If library call, then this sounds like a conflict between what is passed to the linker as already-defined and what is defined in GCC header files. I think library call problem is that when we were using an older compiler, we were compiling with GCC 4.2 but linking to 4.4 or 4.5's libraries, so there were API mismatches that were defined in TinyCLR header files to allow linking to complete with the assumption these would never be called. Another example of this is TinyCLR including old RVDS libraries and linking against them. I have a vague recollection of running into the problem of a TinyCLR header file defining a missing FP pow or division function instead of calling the math library -- this shows up as an redefinition error when using the next CodeSourcery compiler version.
"bad cleaning theory and SymDef mystery" If you saw a bogus FP instruction when using -mfloat-abi=softfp on the DotNow, then there should have been software linkage to show what the FP library call was. I wonder if there is some combination of target options that did not use -mcpu=cortex-m3, or whether our "clean" build target is missing the removal of a library that would have been built with the Cortex-A8. I get a warning related to creating the ARM SymDefs when switching between compiling for SOC_ADAPT and EmoteDotNow, so perhaps the target that creates the SymDefs doesn't clean properly (although I am under the impression that the SymDefs target is related to compatibility with the ARM linker).
"note on presence of FP types in code"
So HeapBlock + Debug = code called with type double if !defined(TINYCLR_EMULATED_FLOATING_POINT).
Defining TINYCLR_EMULATED_FLOATING_POINT means no CLR code should use type double (for tool chains that do not have a type "double"). In certain cases, this makes code simpler by using INT64 instead of double, including type casts where there aren't any math operations that would require proper fixed point configuration. When TINYCLR_EMULATED_FLOATING_POINT is not defined (as in our builds), then there do exist floating point types in TinyCLR code. Then the question becomes whether the compiler emits floating point operations or the linker connects libraries to softfp calls.
"precedence of compiler flags and default fpu" I recently noticed that the placement of our "-mcpu=Cortex-M3" GCC flag changed. The docs say -mcpu and -march flags make the right-most definition take higher precedence. I don't know about precedence between mcpu and other flags on the command line, although my guess is that other flags override mcpu flags. Look at the output of "arm-none-eabi-gcc -mcpu=cortex-m3 -Q --help=target":
The following options are target specific:
-mabi= aapcs
-mabort-on-noreturn [disabled]
-mapcs [disabled]
-mapcs-float [disabled]
-mapcs-frame [disabled]
-mapcs-reentrant [disabled]
-mapcs-stack-check [disabled]
-march= armv2
-marm [enabled]
-mbig-endian [disabled]
-mcallee-super-interworking [disabled]
-mcaller-super-interworking [disabled]
-mcirrus-fix-invalid-insns [disabled]
-mcpu= cortex-m3
-mfix-cortex-m3-ldrd [enabled]
-mfloat-abi= soft
-mfp16-format= none
-mfp=2
-mfp=3
-mfpe [disabled]
-mfpe=2
-mfpe=3
-mfpu= fpa
-mhard-float
-mlittle-endian [enabled]
-mlong-calls [disabled]
-mpic-register=
-mpoke-function-name [disabled]
-msched-prolog [enabled]
-msingle-pic-base [disabled]
-msoft-float
-mstructure-size-boundary= 0x20
-mthumb [disabled]
-mthumb-interwork [enabled]
-mtls-dialect= gnu
-mtp= auto
-mtpcs-frame [disabled]
-mtpcs-leaf-frame [disabled]
-mtune= [default]
-munaligned-access [enabled]
-mvectorize-with-neon-double [disabled]
-mvectorize-with-neon-quad [enabled]
-mword-relocations [disabled]
-mwords-little-endian [disabled]
Known ARM ABIs (for use with the -mabi= option):
aapcs aapcs-linux apcs-gnu atpcs iwmmxt
Known ARM architectures (for use with the -march= option):
armv2 armv2a armv3 armv3m armv4 armv4t armv5 armv5e armv5t armv5te armv6
armv6-m armv6j armv6k armv6s-m armv6t2 armv6z armv6zk armv7 armv7-a armv7-m
armv7-r armv7e-m ep9312 iwmmxt iwmmxt2 native
Known __fp16 formats (for use with the -mfp16-format= option):
alternative ieee none
Known ARM FPUs (for use with the -mfpu= option):
fpa fpe2 fpe3 fpv4-sp-d16 maverick neon neon-fp16 neon-vfpv4 vfp vfp3 vfpv3
vfpv3-d16 vfpv3-d16-fp16 vfpv3-fp16 vfpv3xd vfpv3xd-fp16 vfpv4 vfpv4-d16
Valid arguments to -mtp=:
auto cp15 soft
Known floating-point ABIs (for use with the -mfloat-abi= option):
hard soft softfp
Known ARM CPUs (for use with the -mcpu= and -mtune= options):
arm1020e arm1020t arm1022e arm1026ej-s arm10e arm10tdmi arm1136j-s
arm1136jf-s arm1156t2-s arm1156t2f-s arm1176jz-s arm1176jzf-s arm2 arm250
arm3 arm6 arm60 arm600 arm610 arm620 arm7 arm70 arm700 arm700i arm710
arm7100 arm710c arm710t arm720 arm720t arm740t arm7500 arm7500fe arm7d
arm7di arm7dm arm7dmi arm7m arm7tdmi arm7tdmi-s arm8 arm810 arm9 arm920
arm920t arm922t arm926ej-s arm940t arm946e-s arm966e-s arm968e-s arm9e
arm9tdmi cortex-a15 cortex-a5 cortex-a7 cortex-a8 cortex-a9 cortex-m0
cortex-m1 cortex-m3 cortex-m4 cortex-r4 cortex-r4f cortex-r5 ep9312 fa526
fa606te fa626 fa626te fa726te fmp626 generic-armv7-a iwmmxt iwmmxt2 mpcore
mpcorenovfp native strongarm strongarm110 strongarm1100 strongarm1110 xscale
TLS dialect to use:
gnu gnu2
Note -mfpu=fpa, so it would be nice to check the instruction Nathan encountered to see if it was an FPA opcode that would normally be implemented (by the OS?) in an illegal instruction fault handler, assuming the -mfloat-abi=softfp overrode the -mcpu=Cortex-M3's -mfloat-abi=soft but left the -mfpu=fpa. Likewise, it would be nice to know if the -mcpu=Cortex-M3 could override earlier flags if placed to the right of the conflicting flags (doubt it but could also be related to transient nature of this issue).
mcpu
does not imply mfpu
or the float-abi
. But I could very well be wrong. But in any case I think this would only apply if we had a library issue, which we hadn't seen yet (but may be present).Lets not worry about the library issue unless we actually encounter it, so far we have not, but we might. Exactly how the bad opcodes were included in the first place is perhaps not so important either, except for the question of if this whole thing is a red herring. This could simply be a case of garbage in -- garbage out.
The real question is if a normal build is trying to do soft
calls. I will trace an execution and find out.
Below is the actual g++ configuration output for EmoteDotNow's target.
VPUSH
and VPOP
may be indicative of FP instructions. Note -marm
is disabled but -mthumb-interwork
is enabled, so FP may be allowed anyway with the interwork option.mcpu
does not need to imply mfpu. My point is that -mfpu
is set by default to fpa
in gcc and g++. So GCC has a default configuration for fpu, even if the flag is not specified. So if you used -mfloat-abi=softfp
, then the toolchain may have used FPA instructions. (Although I think FPA is only compatible with ATPCS, not AAPCS. And VFP should be the default with recent arm architectures.)-mthumb
enabled with -marm
disabled just means the compiler will only output thumb instructions. But the -mthumb-interwork
means the thumb code may call ARM library functions. Although THUMB cannot access VFP registers, the -mfloat-abi=softfp
links against ARM library functions that do call FP instructions.
Also note that we're linking with --no-warn-mismatch
which I recommend we remove as this might have caught the issue. I also recommend we remove the -mthumb-interwork
for Cortex-M3 and verify the assembler is provided a proper cpu configuration. -mthumb-interwork
should be meaningless for architecture arm-v7m, so there's no reason to enable it and invite trouble on the chance that GCC doesn't enforce armv7-m addressing in presence of -mthumb-interwork
.
"c:\CodeSourcery-2013-05\bin\arm-none-eabi-g++.exe" -xc++ -Wno-invalid-offsetof -fcheck-new -fno-threadsafe-statics -ffunction-sections -fdollars-in-identifiers -fno-exceptions -march=armv7-m -mcpu=cortex-m3 -funsigned-char -mstructure-size-boundary=8 -DTINYCLR_ENABLE_SOURCELEVELDEBUGGING -mthumb -DCOMPILE_THUMB2 -Wno-psabi -DVERSION_MAJOR=4 -DVERSION_MINOR=3 -DVERSION_BUILD=0 -DVERSION_REVISION=0 -DOEMSYSTEMINFOSTRING="\"Samraksh Copyright (C) 2014 Samraksh and Microsoft.\"" -DPLATFORM_ARM_STM32F10x -DPLATFORM_ARM_EmoteDotNow -DTARGETLOCATION_FLASH -DLITTLE_ENDIAN -DGCC -mfloat-abi=soft -mlittle-endian -g3 -O0 -DSAM_FORCE_GCC_CMDLINE_OPTS -DDEBUG -D_DEBUG -Ic:\repo\DotNet-MF\MicroFrameworkPK_v4_3\CLR\Core\HeapPersistence -Ic:\repo\DotNet-MF\MicroFrameworkPK_v4_3\DeviceCode\include -Ic:\repo\DotNet-MF\MicroFrameworkPK_v4_3\DeviceCode\Targets\Native\
STM32F10x\DeviceCode\cmsis -Q --help=target
cc1.exe: warning: command line option '-Wno-invalid-offsetof' is valid for C++/ObjC++ but not for C [enabled by default]
cc1.exe: warning: command line option '-fcheck-new' is valid for C++/ObjC++ but not for C [enabled by default]
cc1.exe: warning: command line option '-fno-threadsafe-statics' is valid for C++/ObjC++ but not for C [enabled by default]
The following options are target specific:
-mabi= aapcs
-mabort-on-noreturn [disabled]
-mapcs [disabled]
-mapcs-float [disabled]
-mapcs-frame [disabled]
-mapcs-reentrant [disabled]
-mapcs-stack-check [disabled]
-march= armv7-m
-marm [disabled]
-mbig-endian [disabled]
-mcallee-super-interworking [disabled]
-mcaller-super-interworking [disabled]
-mcirrus-fix-invalid-insns [disabled]
-mcpu= cortex-m3
-mfix-cortex-m3-ldrd [enabled]
-mfloat-abi= soft
-mfp16-format= none
-mfp=2
-mfp=3
-mfpe [disabled]
-mfpe=2
-mfpe=3
-mfpu= fpa
-mhard-float
-mlittle-endian [enabled]
-mlong-calls [disabled]
-mpic-register=
-mpoke-function-name [disabled]
-msched-prolog [enabled]
-msingle-pic-base [disabled]
-msoft-float
-mstructure-size-boundary= 0x8
-mthumb [enabled]
-mthumb-interwork [enabled]
-mtls-dialect= gnu
-mtp= auto
-mtpcs-frame [disabled]
-mtpcs-leaf-frame [disabled]
-mtune= [default]
-munaligned-access [enabled]
-mvectorize-with-neon-double [disabled]
-mvectorize-with-neon-quad [enabled]
-mword-relocations [disabled]
-mwords-little-endian [disabled]
Known ARM ABIs (for use with the -mabi= option):
aapcs aapcs-linux apcs-gnu atpcs iwmmxt
Known ARM architectures (for use with the -march= option):
armv2 armv2a armv3 armv3m armv4 armv4t armv5 armv5e armv5t armv5te armv6
armv6-m armv6j armv6k armv6s-m armv6t2 armv6z armv6zk armv7 armv7-a armv7-m
armv7-r armv7e-m ep9312 iwmmxt iwmmxt2 native
Known __fp16 formats (for use with the -mfp16-format= option):
alternative ieee none
Known ARM FPUs (for use with the -mfpu= option):
fpa fpe2 fpe3 fpv4-sp-d16 maverick neon neon-fp16 neon-vfpv4 vfp vfp3 vfpv3
vfpv3-d16 vfpv3-d16-fp16 vfpv3-fp16 vfpv3xd vfpv3xd-fp16 vfpv4 vfpv4-d16
Valid arguments to -mtp=:
auto cp15 soft
Known floating-point ABIs (for use with the -mfloat-abi= option):
hard soft softfp
Known ARM CPUs (for use with the -mcpu= and -mtune= options):
arm1020e arm1020t arm1022e arm1026ej-s arm10e arm10tdmi arm1136j-s
arm1136jf-s arm1156t2-s arm1156t2f-s arm1176jz-s arm1176jzf-s arm2 arm250
arm3 arm6 arm60 arm600 arm610 arm620 arm7 arm70 arm700 arm700i arm710
arm7100 arm710c arm710t arm720 arm720t arm740t arm7500 arm7500fe arm7d
arm7di arm7dm arm7dmi arm7m arm7tdmi arm7tdmi-s arm8 arm810 arm9 arm920
arm920t arm922t arm926ej-s arm940t arm946e-s arm966e-s arm968e-s arm9e
arm9tdmi cortex-a15 cortex-a5 cortex-a7 cortex-a8 cortex-a9 cortex-m0
cortex-m1 cortex-m3 cortex-m4 cortex-r4 cortex-r4f cortex-r5 ep9312 fa526
fa606te fa626 fa626te fa726te fmp626 generic-armv7-a iwmmxt iwmmxt2 mpcore
mpcorenovfp native strongarm strongarm110 strongarm1100 strongarm1110 xscale
TLS dialect to use:
gnu gnu2
I performed an exhaustive trace using the "Hello World" C# app.
Result:
Specifically, we were hitting:
ldivmod
uidiv
uidivmod
Of these the first is the "real" offender, the others are acting as helper functions.
The MF architecture is to keep time as a global 64-bit variable. All of the 64-bit division operations I traced were due to computations related to time and occurred in the Time PAL driver. It turns out that all MF time operations will usually result in at least one, usually several 64-bit divisions. This happens in large part because MF time is normalized against a fictional 10 MHz clock and normalization involves 64-bit division. We hit this penalty even in a simple GetTime().
For example:
return GetMachineTime() / TIMEUNIT_TO_MILLISECONDS;
and
time = (time * m_Ticks_b + m_Ticks_c) / m_Ticks_a;
where time and GetMachineTime() are 64-bit.
As for how much this hurts:
Conclusions:
TODO: It probably does behoove us to look into this a little more and to consider if we can do anything to reduce the impact of the situation. I will do some thinking on this. In the meantime leaving this bug open for comment.
After some thought, I don't see a good solution that doesn't involve entirely re-writing the way MF handles time. Doing so would be very tricky since Time is not just an under-the-hood concept and any changes would have implications all the way up the foodchain to the .NET interface specs and C# user code. Although this problem is an annoyance, IMO it does not rise to the level of making such a re-write a reasonable recourse.
At the PAL layer and above this issue is for now unavoidable.
At the HAL layer, driver writers should only work with time in native ticks. This means driver writers should use HAL_Time* functions such as HAL_Time_CurrentTicks() and NOT use any of the PAL layer Time* fucnctions such as Time_GetMachineTime(). We sort of already knew this anyway and nobody is running afoul of this to my knowledge.
I am closing the bug. The issue has been investigated and I don't anticipate further movement.
The other day I noticed that even in our basic Hello World app, the interpreter in TinyCLR was trying to call floating point library functions. Would be good to get a better understanding of the situation. Again this is with the interpreter, not one of our drivers or other implementations.
My feeling is that FP should not be something that just happens...