Addon not at performance target in getDuplicateVerticies

tngreene commented 6 years ago

Currently a small plane like the BD5J microjet is taking about 4 minutes to export. We would like to get it down where even large plans only take about 15 seconds at most.

See new Optimize label for related bugs and feature requests

tngreene commented 6 years ago

Currently the biggest time sink appears to be the repetitive string concatenation in the write methods, particularly in xplane_mesh.write since there are so many of them.

der-On commented 6 years ago

@tngreene I guess this is because string concatenation allocates new memory each time. Maybe we could get around this using lists of strings that get joined into strings at the very last resort? This should prevent double allocation of strings as we just store references to them in the lists?

tngreene commented 6 years ago

So: By using an array object from the array module instead of a string, not attempting to trim trailing 0's and not attempting to int-fy int like numbers, we have been able to make great strides!

A 3 minute and 14 second export in 14 seconds!

The not removing the trailing 0's, however, did increase the file size at a certain point. From roughly 60MB to 76MB.

Some questions: How costly is striping 0s?

der-On commented 6 years ago

If possible you could try to strip the zeros after everything was joined into a string using a regexp. Regexp are known to be really fast even on huge chunks of text. However they must be carefully crafted to not break things.

If a regexp wont work in all situations you could try to write a smarter search replace that can handle the edge cases and removes the zeros by first splitting the text along duplicate or trailing zeros into an array and then investigating the loose ends and eventually removing useless zeros from one or the other end and then joining the thing together again after all loose ends have been investigated.

Actually creating a regexp that turns X.000 into X should be simple. And one that turns X.YZ00 into X.YZ also does not sound like rocket sience.

tngreene commented 6 years ago

True. I'll try compiled regex first. I'm worried the more branching if statements else statements there are the more slow it will get with branch prediction failures. Given the Python community's near militant "Zen Of Python" mindset I'm having trouble getting definitive answers on how to optimize python code that is recent.

tngreene commented 6 years ago

Status update: xplane_mesh's write method has been well optimized using arrays, list comprehensions, minimizing the amount of string concatenations. The current issues are:

By removing the call to floatToStr, we have trailing 0s that can increase the size of the OBJ by megabytes in size.
This was never tested with "Optimize" on, because obviously that is slower and we want to remove it anyway
collectXPlaneObject appears to be the next highest time sink to go after, spending roughly 40% of the execution in there. The big things in there are XPlaneMesh.faceToTrianglesWithUV and the many many many calls to XPlaneFace's init function.
Inside faceToTrianglesWithUV we make a lot of calls to appending dictionaries to a list of triangles.
So far, all we have is Dan's 100 sphere test and the BD5J microjet to play with. All of these tests have been preformed on those two.

tngreene commented 6 years ago

The calls to XPlaneFace is basically a struct that's only used in xplane_mesh's collectXPlaneObjects method.

See line 78,106 through 150

        faces = []
...
                    for f in tempfaces:
                        xplaneFace = XPlaneFace()
...
                            xplaneFace.vertices[i] = (vert[0], vert[1], vert[2])
                            xplaneFace.normals[i] = (vert[3], vert[4], vert[5])
                            xplaneFace.uvs[i] = (vert[6], vert[7])
                            xplaneFace.indices[i] = index
...
                        faces.append(xplaneFace)

We could replace XPlaneFace with a dictionary or even a named tuple or something like that and it might be faster than bringing all the baggage of a class with it.

tngreene commented 6 years ago

I tried the trick of

append = triangles.append to remove the . look up but I don't think it caused statistically significant or repeateable performance increases. Perhaps with enough dot lookups remove it would show some fractions of a second off on a consistent basis, but otherwise, I think its a micro-optimization too small to count on.

tngreene commented 6 years ago

This is a copy of basically all the methods I tried to speed this up with. In the end a medium amount of "Just the most stupid thing" was the right choice. Seconds represent total cumulative export time using performance_test_sphere_no_anim.test.blend

'''
# Create all data without stripping, use a tiny FSM to copy backwards only what we care about,
# reverse and return
# Did not wait for it too finish.
if debug:
    vt_fmt = "VT\t%.8f\t%.8f\t%.8f\t%.8f\t%.8f\t%.8f\t%.8f\t%.8f\t# %d\n"
else:
    vt_fmt = "VT\t%.8f\t%.8f\t%.8f\t%.8f\t%.8f\t%.8f\t%.8f\t%.8f\n"

b = bytearray()
if debug:
    b += b''.join([bytes(vt_fmt % (*vt_array[i:i+8],i//8),"utf-8") for i in range(0,len(vt_array),8)])
else:
    b += b''.join([bytes(vt_fmt % (*vt_array[i:i+8],),"utf-8") for i in range(0,len(vt_array),8)])
bf = bytearray()
take_chars = False
i = len(b)-1
while i >= 0:
    print(i)
    c = chr(b[i])
    #print("c: " + c + "(" + str(b[i]) + ")")

    if take_chars is False and (c != '0' and c != '.'):
        take_chars = True

    if take_chars:
        bf.append(b[i])
    else:
        pass
        #print("skipping: " + c)

    if c == '\t' or\
       c == '\n':
       #chr(b[i]) == '\n' or\
       #chr(b[i]) == '#'  or\
       #chr(b[i]) == 'T'  or\
       #chr(b[i]) == 'V':
       take_chars = False

    i -= 1

print("end XPlaneMesh.writeVertices " + str(time.perf_counter()-start))
#print(b.decode())
return bf.decode()[::-1]
'''
'''
60 seconds-ish
b = bytearray()
line = 1
for i,component in enumerate(vt_array):
    if i% 8 == 0:
        if i == 0:
            b += b"VT"
        else:
            line += 1
            if debug:
                b += bytes("\t# %d\n" % (line),"utf-8")
            else:
                b += b"\n"
    #if component.is_int():
    #    b += bytes("\t%d" % int(component)).encode()
    #else:
    if True:
        sb = bytes("\t%.8f" % component,"utf-8").rstrip(b'0')
        if sb[-1] == 46:# '.':
            b += sb+b'0'
        else:
            b += sb
'''

'''
#Use of map instead of second for loop
def format_comp(component):
    sb = bytes("\t%.8f" % component,"utf-8").rstrip(b'0')
    if sb[-1] == 46:# '.':
        return sb+b'0'
    else:
        return sb

#63 seconds
b = bytearray()
for line in range(len(vt_array)//8):
    b += b"VT"
    b += b''.join(map(format_comp,vt_array[line:line+8]))

    if debug:
        b += bytes("\t# %d\n" % line,"utf-8")
    else:
        b += b"\n"
'''
'''
#Loop through every line, format it's 8 components, use rstrip, if statement for 10.00000000->10.0
#56-64 seconds
b = bytearray()
for line in range(len(vt_array)//8):
    b += b"VT"
    for component in  vt_array[line:line+8]:
        sb = bytes("\t%.8f" % component,"utf-8").rstrip(b'0')
        if sb[-1] == 46:#'.':
            b += sb
            b += b'0'
        else:
            b += sb

    if debug:
        b += bytes("\t# %d\n" % line,"utf-8")
    else:
        b += b"\n"
'''

'''
# Use of regular expresion to extract all trailing 0's after the first place of the decimal
# so 10.00000000 becomes 10.0
# compiling regex was performed outside of writeVertices, but is not put here to keep the code together
# 72 seconds
r = re.compile("\t([-]?\d+\.(\d+?))(?:0*$)")
b = bytearray()
for line in range(len(vt_array)//8):
    b += (b"VT")
    for component in  vt_array[line:line+8]:
        #b += b'\t'
        s = "\t%.8f" % component
        b += bytearray(r.match(s).group(1),"utf-8")

    if debug:
        b += bytearray(("\t# %d\n" % line),"utf-8")
    else:
        b += (b"\n")
'''

'''
# List comprehension, no 0 stripping, using string for appending (for no other reason than I hadn't thought of
# it at this point). Using bytearray might take .5 to 1 second off.
# 34-40 seconds
if debug:
    o += ''.join([vt_fmt % (*vt_array[i:i+8],i//8) for i in range(0,len(vt_array),8)])
else:
    o += ''.join([vt_fmt % (*vt_array[i:i+8],) for i in range(0,len(vt_array),8)])

'''

tngreene commented 6 years ago

If we want to push this harder we will definitely need to start looking into cython, which is not that bad and works with Blender!

tngreene commented 6 years ago

Calling it done for now at 331d8304f8

tngreene commented 5 years ago

We should also look at using slots for XPlaneMesh and some of the other classes we make a billion of. It could provide an easy attribute access speed boost. See Usage Of Slots

tngreene commented 4 years ago

Another much less difficult thing than full blown cython is the use of ctypes to access those C native functions I wanted so badly. I didn't know it was an option.

tngreene commented 4 years ago

@kbrandwijk Speaking of computers and times, I have a spreadsheet of the time it takes to export the BD-5J microjet with optimize off (of course, now it doesn't matter), and what OS and CPU they have.

All completed using 2.79, x64 bit OS, print debug info on, optimize off (unless noted. Time is in seconds

.blend	Verts	ANIM_begin count	2.49 Time, PC	2.49 Time, Mac	2.79 Time, PC	2.79 Time Mac	Vert/Sec (PC)	Vert/Sec (Mac)	CPU, PC	CPU, Mac	Notes:
BD-5J Microjet (Ted)	76,983	237	N/A	N/A	25	?	3079.320	#VALUE!	i7-6700HQ @ 2.60GHz

Could you fill in those values as well?

tngreene commented 4 years ago

Also, I realize that this and #273 are almost entirely the same bug. Don't know how that happened.

kbrandwijk commented 4 years ago

Sure, I'll fill those in. Does it write those timing values anywhere? I added my own instrumentation to the relevant methods for debugging, but I would like to report the same numbers you're using.,

kbrandwijk commented 4 years ago

Also, I really like these performance challenges, so I'm more than willing to make that my first focus point.

tngreene commented 4 years ago

It isn't scientific. For my computer they come from PyVmMonitor. I asked other people use the "literal stopwatch" method of starting a stop watch when they click go and stopping it when the UI becomes responsive again - which from a users perspective is when it is really done.

With no way to compare CPUs, so few data points, it doesn't amount much more to trivia. Maybe one day it'll be more useful.

kbrandwijk commented 4 years ago

Using the 'stopwatch' method for now:

.blend	Verts	ANIM_begin count	2.49 Time, PC	2.49 Time, Mac	2.79 Time, PC	2.79 Time Mac	Vert/Sec (PC)	Vert/Sec (Mac)	CPU, PC	CPU, Mac	No
BD-5J Microjet (Ted)	76,983	237	N/A	N/A	10	?			i9-9900KF @ 5.00GHz

tngreene commented 4 years ago

Wow, and that was still optimize off? You certainly have a better computer than most people I've asked. Most people have an i7 between 2.30Ghz and 3.6GHz

kbrandwijk commented 4 years ago

Yes, with optimize off. And that's why it's important I can accurately measure relative performance of all parts of the (export) process, because the absolute values are, well, low on my machine.

tngreene commented 4 years ago

Related to a comment on ... somewhere, lost track ... about XPlaneKeyframe.init taking up too much time. I agree, but the API is quite involved.

XPlaneFace's init call should be much easier to remove since it has so few uses. Python is so bad at POD... and then I see that it is barely .4 seconds of time. Maybe we can justify that little journey as a memory saving technique not to spawn all those XPlaneFace objects when a named tuple would suffice.

Maybe adding slots to XPlaneKeyframe would help something, not speed unfortunately but memory foot print? We do make a lot of them. Its an easy fix.

kbrandwijk commented 4 years ago

Yeah that's the other ticket (#462). XPlaneFace init didn't bring me any performance gain. Not by introducing slots there, and not by getting rid of it altogether.

I'm not into the ms level yet, when I can still save seconds. I'll comment in the other ticket, to keep things slightly organized.

tngreene commented 4 years ago

https://github.com/X-Plane/XPlane2Blender/pull/464 Can be applied later when I have time to investigate if changing the rounding method changes author's work. It looks promising. I'll get to eventually.

This code has been included in Blender v4.0.0 for a bit now. Very happy with it.

tngreene commented 4 years ago

I don't know where else to put this note without opening a whole new bug over it: I wanted to try and speed up floatToStr because it gets called so much. I tried ctypes and it didn't give me the level of performance I thought it would, in fact, it was much slower than expected. See

https://stackoverflow.com/questions/61231308/python-ctypess-sprintf-formats-any-float-type-as-b0-000000-or-b5-25662e-315

X-Plane / XPlane2Blender

Addon not at performance target in getDuplicateVerticies #347