apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.6k stars 3.44k forks source link

[uTVM][Runtime] Deprecate uTVM Standalone Runtime #5060

Open liangfu opened 4 years ago

liangfu commented 4 years ago

Since the MISRA-C runtime has been merged in PR #3934 and discussed in RFC #3159 , I think now it's time to migrate uTVM standalone runtime ( introduced in PR #3567 )

Rationale

Actionable Items

Please leave your comment.

cc @areusch

tqchen commented 4 years ago

Cross posting to here. I think it worth to think about memory allocation strategy. Specificially, we should design an API that contains a simple allocator(which is arena like and allocate memory from a stack, and release everything once done), and use that allocator for all memories in the program(including data structures and tensors). This will completely eliminate the usage of system calls and allow the program o run in bare metal.

Example API

// call use system call to get the memory, or directly points to memory segments in ucontroller
UTVMAllocator* arena = UTVMCreateArena(10000);
// Subsequent data structures are allocated from the allocator
// The free calls will recycle data into the allocator
// The simplest strategy is not to recycle at all
UTVMSetAllocator(arena);

// normal TVM API calls
tmoreau89 commented 4 years ago

@liangfu regarding "superseding uTVM standalone runtime", will MISRA-C runtime support running on bare-metal systems?

tmoreau89 commented 4 years ago

@ajtulloch @weberlo @u99127 (this might be of interest to you)

liangfu commented 4 years ago

@liangfu regarding "superseding uTVM standalone runtime", will MISRA-C runtime support running on bare-metal systems?

Yes, at least it intended to be, but how shall we provide a proper demo on this? Any idea?

tmoreau89 commented 4 years ago

We can test it on the STM board that @weberlo implemented a demo on: https://github.com/apache/incubator-tvm/pull/4274

liangfu commented 4 years ago

Excellent idea. Perhaps we can also test the bare-metal demo in CI, with a simple RISCV processor like picorv32.

KireinaHoro commented 4 years ago

Cross posting to here. I think it worth to think about memory allocation strategy. Specificially, we should design an API that contains a simple allocator(which is arena like and allocate memory from a stack, and release everything once done), and use that allocator for all memories in the program(including data structures and tensors). This will completely eliminate the usage of system calls and allow the program o run in bare metal.

@tqchen Removing all external allocator use and go with an embedded arena allocator sounds a little bit fishy. Bare-metal platforms does not necessarily lack a proper allocator; newlib, for example, provides a pretty usable dlmalloc implementation. Are there any other concerns?

liangfu commented 4 years ago

In PR #5124, we have a reference allocator, which implements vmalloc, vrealloc, and vfree. When necessary, I think we can redirect the function calls to different implementations, e.g. dlmalloc in newlib, jemalloc and many others.

I would agree with @KireinaHoro to use implementations in newlib for bare-metal applications.

For arena like allocator, I have concerns on how shall we deal with large memory reuse between conv layers, if we don't release allocated workspaces timely.

tqchen commented 4 years ago

The workspace memory could have a different strategy. The way it works is that we create a different arena for workspace, along with a counter.

This will work because all workspace memory are temporal. It also guarantees a constant time allocation

As a generalization. If most memory allocation happens in a RAII style lifecycle. e.g. everything de-allocates onces we exit a scope, then the counter based strategy(per scope) is should work pretty well.

I am not fixated about the arena allocator, but would like to challenge us to think a bit how much simpler can we make the allocation strategy looks like given what we know about the workload. Of course, we could certainly bring sub-allocator strategies that are more complicated, or fallback to libraries when needed

u99127 commented 4 years ago

Thanks for pointing this to me @tmoreau89 and thank you for this work @liangfu . Very interesting and good questions to ask.

From a design level point of view for micro-controllers I'd like to take this one step further and challenge folks to think about whether this can be achieved with static allocation rather than any form of dynamic allocation . The hypothesis being that at compile time one would know how much temporary space is needed between layers rather than having to face a run time failure.

Dynamic allocation on micro-controllers suffers from fragmentation issues and further do we want to have dynamic allocation in the runtime on micro-controllers. Further the model being executed will be part of a larger application - how can we allow our users to specify the amount of heap available or being consumed for executing their model ? It would be better to try to provide that with diagnostics at link time or compilation time rather than at runtime. @mshawcroft might have more to add. And yes, in our opinion for micro-controllers one of the challenges is the availability and usage of temporary storage for working set calculations between layers.

2 further design questions.

  1. In the micro-controller world, supporting every new device with their different memory maps and what not will be painful and beyond one simple reference implementation, I don't think we have an efficient route to deployment other than integrating with other platforms in the microcontroller space. How would this runtime integrate with other platforms like Zephyr, mbedOS or FreeRTOS ?

  2. I'd be interested in extending CI with qemu or some such for Cortex-M as well or indeed on the STM board that you are using @tmoreau89 .

Purely a nit but from a rationale point of view, I would say that uTVM runtime not being tested in a CI is technical debt :)

regards Ramana

tqchen commented 4 years ago

re: fragmentation issue, think the allocation strategies carefully and adopt an arena-style allocator(counter based as above) can likely resolve the issue of fragementation. In terms of the total memory cost, we can indeed found the cost out during compile time for simple graph programs

liangfu commented 4 years ago

It's very interesting to see tflite is using arena like allocator for micro-controllers. See how adafruit demonstrate its PyBadge board with TFLite here.

tqchen commented 4 years ago

@liangfu can you try to do a arena based approach given that it is simpler? We could adopt the counter based approach to enable early free of sub-arenas(when the free counters in the arena decreases to zero, we can free the space)

liangfu commented 4 years ago

Sure, as this is definitely the direction we should follow, I can do that. And maybe we need a separate PR for the arena allocator feature.

Robeast commented 4 years ago

Hi @liangfu is there any update on your current implementation efforts? We are really looking forward to it!!

liangfu commented 4 years ago

Hi @Robeast, thanks for your attention. I only have a draft version of the new allocator for now, I'd like to send a PR soon this week.

masahi commented 2 years ago

Can we close this?