cheney-lin / articals

0 stars 0 forks source link

QEMU #5

Open cheney-lin opened 6 years ago

cheney-lin commented 6 years ago

QEMU 对虚机的地址空间管理

http://blog.csdn.net/beckdon/article/details/50110265

http://www.9i9icenter.com/huanqiuNews/59a4cfb1f5dcd25745790c0e

cheney-lin commented 6 years ago

QEMU内存管理之生成FlatView内存拓扑模型过程分析

本文基于六六哥的博客分析,修正文中的一些错误,原文链接

关于MemoryRegion 可以看QEMU 源码树中的文档memory.txt MemoryRegion 和 Flatview是QEMU用来组织内存布局的两种形式,他们的关系可以参考 《MemoryRegion模型原理,以及同FlatView模型的关系》 每当mr发生变化,QEMU都要通知KVM更新GPA的mapping,首先在QEMU侧需要把树形的内存布局转换成平坦的地址空间,这样KVM侧的处理可以简单些,flatview展开的入口函数为:

/*
* 将MR所管理的内存展开成FlatView后返回
*/
static FlatView *generate_memory_topology(MemoryRegion *mr)
{
    FlatView *view;

     /*新分配一个FlatView并初始化*/
    view = g_new(FlatView, 1);
    flatview_init(view);

    if (mr) {
          /*
          * 将mr所管理的内存进行平坦展开为FlatView,通过view返回
          * addrrange_make(int128_zero(), int128_2_64()) --> 指定一个GUEST内存空间,起始GPA等于0,大小等于2^64,可以想象成GUEST的整个物理地址空间         
          */
        render_memory_region(view, mr, int128_zero(),
                             addrrange_make(int128_zero(), int128_2_64()), false);
    }

     /*简化FlatView,将View中的FlatRange能合并的都合并*/
    flatview_simplify(view);

    return view;
}

核心函数为render_memory_region,这个函数之所以很费解是因为各种变量没有注释而让人捉摸不透,网上有几篇关于QEMU内存管理的博客,可惜或多或少存在一些错误。六六哥的博客我觉得是写的最好的,可惜有一些图没了,我会尽量补上,这样会更容易理解一些。

首先对本文中会提及到的变量说一下我的理解,有的变量的含义我不是很确定就没写

struct MemoryRegion {                                                                                                                                                  
    Object parent_obj;                                                                                                                                                 

    /* All fields are private - violators will be prosecuted */                                                                                                        

    /* The following fields should fit in a cache line */                                                                                                              
    bool romd_mode;                                                                                                                                                    
    bool ram;                                                                                                                                                          
    bool subpage;                                                                                                                                                      
    bool readonly; /* For RAM regions */                                                                                                                               
    bool rom_device;                                                                                                                                                   
    bool flush_coalesced_mmio;                                                                                                                                         
    bool global_locking;                                                                                                                                               
    uint8_t dirty_log_mask;                                                                                                                                            
    bool is_iommu;                                                                                                                                                     
    RAMBlock *ram_block;      //如果该mr申请了内存就指向它的RAMBlock,否则为NULL                                                                                                                                   
    Object *owner;                                                                                                                                                   

    const MemoryRegionOps *ops;        //对mr操作的callback函数,比如读写                                                                                                                             
    void *opaque;                                                                                                                                                      
    MemoryRegion *container;                                                                                                                                           
    Int128 size;         //mr的大小                                                                                                                  
    hwaddr addr;      //相对父mr的偏移,起始GPA=base+addr 可以去源码中搜索一下如何初始化的就明白了                                                                                                        
    void (*destructor)(MemoryRegion *mr);                                                                                                                              
    uint64_t align;                                                                                                                                                    
    bool terminates;                                                                                                                                                   
    bool ram_device;                                                                                                                                                   
    bool enabled;                                                                                                                                                      
    bool warning_printed; /* For reservations */                                                                                                                       
    uint8_t vga_logging_count;                                                                                                                                         
    MemoryRegion *alias;     // 如果本mr是个alias mr,这个字段指向真实的mr,否则为NULL                                                                                                            
    hwaddr alias_offset;        //如果本mr是个alias mr,这个字段表示在真实的mr中的偏移                                                                                                             
    int32_t priority;                                                                                                                                                  
    QTAILQ_HEAD(subregions, MemoryRegion) subregions;                                                                                                                  
    QTAILQ_ENTRY(MemoryRegion) subregions_link;                                                                                                                        
    QTAILQ_HEAD(coalesced_ranges, CoalescedMemoryRange) coalesced;                                                                                                     
    const char *name;                                                                                                                                                  
    unsigned ioeventfd_nb;                                                                                                                                             
    MemoryRegionIoeventfd *ioeventfds;                                                                                                                                 
};                                                                                                                                                                     
/* Render a memory region into the global view.  Ranges in @view obscure
* ranges in @mr.
*/
/*
* render: 致使;提出;实施;着色;以…回报
*/
/*
* 将MR中所管理的内存,在clip指定的地址空间,逐个形成FlatRange后,将所有的FlatRange加入FlatView中
*
* @view:           待形成的View
* @mr:              待展平的Mr
* @clip:            待展开的内存将被展开在clip所在的区域内,第一次render时clip代表了整个物理地址空间
* @base:          可以理解为父mr的起始GPA
* @readonly:     读写属性
*/
static void render_memory_region(FlatView *view,
                                 MemoryRegion *mr,
                                 Int128 base,
                                 AddrRange clip,
                                 bool readonly)
{
    MemoryRegion *subregion;
    unsigned i;
    hwaddr offset_in_region;     /*在region对应的真实物理内存的偏移量*/
    Int128 remain;      /*待展开内存的长度*/
    Int128 now;      /*本次展开的长度或跳过的长度*/
    FlatRange fr;
    AddrRange tmp;

    if (!mr->enabled) {
        return;
    }

     /*base改为MR的起始地址
     addr实际上是subregion在父mr的偏移
     */
    int128_addto(&base, int128_make64(mr->addr));
    readonly |= mr->readonly;

     /*取得mr所表示的物理地址范围tmp*/
    tmp = addrrange_make(base, mr->size);

     /*更新clip为MR所代表的地址空间*/
    if (!addrrange_intersects(tmp, clip)) {
        return;
    }

    clip = addrrange_intersection(tmp, clip);

     /*如果是alias类型的MR,首先对原始MR进行FlatView展开*/
  //根据这几行代码,我们可以得知alias的GPA = origin mr的addr + alias mr的alias_offset
    if (mr->alias) {
          /*将base指向alias源MR的起始地址位置*/
        int128_subfrom(&base, int128_make64(mr->alias->addr));
        int128_subfrom(&base, int128_make64(mr->alias_offset));
        render_memory_region(view, mr->alias, base, clip, readonly);
        return;//不再展开自己了,见注1. 
    }

    /* Render subregions in priority order. */
     /* 对所有子MR递归进行FlatView展开 */
    QTAILQ_FOREACH(subregion, &mr->subregions, subregions_link) {
        render_memory_region(view, subregion, base, clip, readonly);
    }

    if (!mr->terminates) {
        return;
    }

     /*
     * 运行都这里说明MemoryRegion的子MR都已经展开了
     */
     /*
     * 更新offset_in_region,offset_in_region是mr的gpa与clip的偏移量,由于我们从clip.start开始render,因此将作为后面fr的offset_in_region,以后用来计算本FR对应MR的物理内存的HVA
     */
    offset_in_region = int128_get64(int128_sub(clip.start, base));
     /*
     * 准备展开MR为FlatRange,所有的FlatRange组成FlatView
     * clip为待展开MR
     * 更新base为clip的起始,remain为待展开的长度
     */
    base = clip.start;
    remain = clip.size;

    fr.mr = mr;
    fr.dirty_log_mask = mr->dirty_log_mask;
    fr.romd_mode = mr->romd_mode;
    fr.readonly = readonly;

    /* Render the region itself into any gaps left by the current view. */
     /* 开始展开 */
    for (i = 0; i < view->nr && int128_nz(remain); ++i) {
          /*跳过FlatView中在clip前面的FR*/
        if (int128_ge(base, addrrange_end(view->ranges[i].addr))) {
            continue;
        }

          /*
          * 处理clip起始小于当前range起始的情况
          * 展开
          */
        if (int128_lt(base, view->ranges[i].addr.start)) {
               /*计算填空部分大小*/
            now = int128_min(remain,
                             int128_sub(view->ranges[i].addr.start, base));
               /*填充新的Fr信息*/
            fr.offset_in_region = offset_in_region;
            fr.addr = addrrange_make(base, now);
               /*将新的Fr信息填充到插入到FlatView的当前位置,以前该位置往后的FlatRange都向后顺移了一位*/
            flatview_insert(view, i, &fr);
               /*i++执行原来插入位置FlatRange*/
            ++i;

            int128_addto(&base, now);
            offset_in_region += int128_get64(now);
            int128_subfrom(&remain, now);
        }

          /*跳过重叠的部分*/
          /*计算重叠部分的长度,现在now是已经被其他fr占据的区间*/
        now = int128_sub(int128_min(int128_add(base, remain),
                                    addrrange_end(view->ranges[i].addr)),
                         base);
          /*跳过重叠部分*/
        int128_addto(&base, now);
        offset_in_region += int128_get64(now);
        int128_subfrom(&remain, now);
    }

     /*遍历完所有现有的FlatRange后,最后发现还有未展开的内存,这里处理其展开*/    
    if (int128_nz(remain)) {
          /*填入FR的信息*/
        fr.offset_in_region = offset_in_region;
        fr.addr = addrrange_make(base, remain);
          /*插入该FR*/
        flatview_insert(view, i, &fr);
    }
}

注1. 原文说如果是alias还要展开自己,是不对的。事实上,alias mr和它指向的mr的“gpa”(base+addr)并不一定一致,如果都展开,映射就重复了。下面是info mtree 打印出来的虚拟机address-space及 memory-region(被alias mr指向的mr)

address-space: memory
  0000000000000000-ffffffffffffffff (prio 0, RW): system
    0000000000000000-00000000bfffffff (prio 0, RW): alias ram-below-4g @pc.ram 0000000000000000-00000000bfffffff
    0000000000000000-ffffffffffffffff (prio -1, RW): pci
      00000000000a0000-00000000000bffff (prio 1, RW): cirrus-lowmem-container
        00000000000a0000-00000000000a7fff (prio 1, RW): alias vga.bank0 @vga.vram 0000000000000000-0000000000007fff
        00000000000a0000-00000000000bffff (prio 0, RW): cirrus-low-memory
        00000000000a8000-00000000000affff (prio 1, RW): alias vga.bank1 @vga.vram 0000000000008000-000000000000ffff
      00000000000c0000-00000000000dffff (prio 1, RW): pc.rom
      00000000000e0000-00000000000fffff (prio 1, R-): alias isa-bios @pc.bios 0000000000020000-000000000003ffff
      00000000fc000000-00000000fdffffff (prio 1, RW): cirrus-pci-bar0
        00000000fc000000-00000000fc7fffff (prio 1, RW): vga.vram
        00000000fc000000-00000000fc7fffff (prio 0, RW): cirrus-linear-io
        00000000fd000000-00000000fd3fffff (prio 0, RW): cirrus-bitblt-mmio
      00000000febf0000-00000000febf0fff (prio 1, RW): cirrus-mmio
      00000000febf1000-00000000febf1fff (prio 1, RW): virtio-scsi-pci-msix
        00000000febf1000-00000000febf103f (prio 0, RW): msix-table
        00000000febf1800-00000000febf1807 (prio 0, RW): msix-pba
      00000000febf2000-00000000febf2fff (prio 1, RW): virtio-serial-pci-msix
        00000000febf2000-00000000febf201f (prio 0, RW): msix-table
        00000000febf2800-00000000febf2807 (prio 0, RW): msix-pba
      00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
    00000000000a0000-00000000000bffff (prio 1, RW): alias smram-region @pci 00000000000a0000-00000000000bffff
    00000000000c0000-00000000000c3fff (prio 1, RW): alias pam-ram @pc.ram 00000000000c0000-00000000000c3fff [disabled]
    00000000000c0000-00000000000c3fff (prio 1, RW): alias pam-pci @pc.ram 00000000000c0000-00000000000c3fff [disabled]
    00000000000c0000-00000000000c3fff (prio 1, R-): alias pam-rom @pc.ram 00000000000c0000-00000000000c3fff
    00000000000c0000-00000000000c3fff (prio 1, RW): alias pam-pci @pci 00000000000c0000-00000000000c3fff [disabled]
    00000000000c4000-00000000000c7fff (prio 1, RW): alias pam-ram @pc.ram 00000000000c4000-00000000000c7fff [disabled]
    00000000000c4000-00000000000c7fff (prio 1, RW): alias pam-pci @pc.ram 00000000000c4000-00000000000c7fff [disabled]
    00000000000c4000-00000000000c7fff (prio 1, R-): alias pam-rom @pc.ram 00000000000c4000-00000000000c7fff
    00000000000c4000-00000000000c7fff (prio 1, RW): alias pam-pci @pci 00000000000c4000-00000000000c7fff [disabled]
    00000000000c8000-00000000000cbfff (prio 1, RW): alias pam-ram @pc.ram 00000000000c8000-00000000000cbfff [disabled]
    00000000000c8000-00000000000cbfff (prio 1, RW): alias pam-pci @pc.ram 00000000000c8000-00000000000cbfff [disabled]
    00000000000c8000-00000000000cbfff (prio 1, R-): alias pam-rom @pc.ram 00000000000c8000-00000000000cbfff
    00000000000c8000-00000000000cbfff (prio 1, RW): alias pam-pci @pci 00000000000c8000-00000000000cbfff [disabled]
    00000000000c9000-00000000000cbfff (prio 1000, RW): alias kvmvapic-rom @pc.ram 00000000000c9000-00000000000cbfff
    00000000000cc000-00000000000cffff (prio 1, RW): alias pam-ram @pc.ram 00000000000cc000-00000000000cffff [disabled]
    00000000000cc000-00000000000cffff (prio 1, RW): alias pam-pci @pc.ram 00000000000cc000-00000000000cffff [disabled]
    00000000000cc000-00000000000cffff (prio 1, R-): alias pam-rom @pc.ram 00000000000cc000-00000000000cffff
    00000000000cc000-00000000000cffff (prio 1, RW): alias pam-pci @pci 00000000000cc000-00000000000cffff [disabled]
    00000000000d0000-00000000000d3fff (prio 1, RW): alias pam-ram @pc.ram 00000000000d0000-00000000000d3fff [disabled]
    00000000000d0000-00000000000d3fff (prio 1, RW): alias pam-pci @pc.ram 00000000000d0000-00000000000d3fff [disabled]
    00000000000d0000-00000000000d3fff (prio 1, R-): alias pam-rom @pc.ram 00000000000d0000-00000000000d3fff
    00000000000d0000-00000000000d3fff (prio 1, RW): alias pam-pci @pci 00000000000d0000-00000000000d3fff [disabled]
    00000000000d4000-00000000000d7fff (prio 1, RW): alias pam-ram @pc.ram 00000000000d4000-00000000000d7fff [disabled]
    00000000000d4000-00000000000d7fff (prio 1, RW): alias pam-pci @pc.ram 00000000000d4000-00000000000d7fff [disabled]
    00000000000d4000-00000000000d7fff (prio 1, R-): alias pam-rom @pc.ram 00000000000d4000-00000000000d7fff
    00000000000d4000-00000000000d7fff (prio 1, RW): alias pam-pci @pci 00000000000d4000-00000000000d7fff [disabled]
    00000000000d8000-00000000000dbfff (prio 1, RW): alias pam-ram @pc.ram 00000000000d8000-00000000000dbfff [disabled]
    00000000000d8000-00000000000dbfff (prio 1, RW): alias pam-pci @pc.ram 00000000000d8000-00000000000dbfff [disabled]
    00000000000d8000-00000000000dbfff (prio 1, R-): alias pam-rom @pc.ram 00000000000d8000-00000000000dbfff
    00000000000d8000-00000000000dbfff (prio 1, RW): alias pam-pci @pci 00000000000d8000-00000000000dbfff [disabled]
    00000000000dc000-00000000000dffff (prio 1, RW): alias pam-ram @pc.ram 00000000000dc000-00000000000dffff [disabled]
    00000000000dc000-00000000000dffff (prio 1, RW): alias pam-pci @pc.ram 00000000000dc000-00000000000dffff [disabled]
    00000000000dc000-00000000000dffff (prio 1, R-): alias pam-rom @pc.ram 00000000000dc000-00000000000dffff
    00000000000dc000-00000000000dffff (prio 1, RW): alias pam-pci @pci 00000000000dc000-00000000000dffff [disabled]
    00000000000e0000-00000000000e3fff (prio 1, RW): alias pam-ram @pc.ram 00000000000e0000-00000000000e3fff [disabled]
    00000000000e0000-00000000000e3fff (prio 1, RW): alias pam-pci @pc.ram 00000000000e0000-00000000000e3fff [disabled]
    00000000000e0000-00000000000e3fff (prio 1, R-): alias pam-rom @pc.ram 00000000000e0000-00000000000e3fff
    00000000000e0000-00000000000e3fff (prio 1, RW): alias pam-pci @pci 00000000000e0000-00000000000e3fff [disabled]
    00000000000e4000-00000000000e7fff (prio 1, RW): alias pam-ram @pc.ram 00000000000e4000-00000000000e7fff [disabled]
    00000000000e4000-00000000000e7fff (prio 1, RW): alias pam-pci @pc.ram 00000000000e4000-00000000000e7fff [disabled]
    00000000000e4000-00000000000e7fff (prio 1, R-): alias pam-rom @pc.ram 00000000000e4000-00000000000e7fff
    00000000000e4000-00000000000e7fff (prio 1, RW): alias pam-pci @pci 00000000000e4000-00000000000e7fff [disabled]
    00000000000e8000-00000000000ebfff (prio 1, RW): alias pam-ram @pc.ram 00000000000e8000-00000000000ebfff [disabled]
    00000000000e8000-00000000000ebfff (prio 1, RW): alias pam-pci @pc.ram 00000000000e8000-00000000000ebfff [disabled]
    00000000000e8000-00000000000ebfff (prio 1, R-): alias pam-rom @pc.ram 00000000000e8000-00000000000ebfff
    00000000000e8000-00000000000ebfff (prio 1, RW): alias pam-pci @pci 00000000000e8000-00000000000ebfff [disabled]
    00000000000ec000-00000000000effff (prio 1, RW): alias pam-ram @pc.ram 00000000000ec000-00000000000effff
    00000000000ec000-00000000000effff (prio 1, RW): alias pam-pci @pc.ram 00000000000ec000-00000000000effff [disabled]
    00000000000ec000-00000000000effff (prio 1, R-): alias pam-rom @pc.ram 00000000000ec000-00000000000effff [disabled]
    00000000000ec000-00000000000effff (prio 1, RW): alias pam-pci @pci 00000000000ec000-00000000000effff [disabled]
    00000000000f0000-00000000000fffff (prio 1, RW): alias pam-ram @pc.ram 00000000000f0000-00000000000fffff [disabled]
    00000000000f0000-00000000000fffff (prio 1, RW): alias pam-pci @pc.ram 00000000000f0000-00000000000fffff [disabled]
    00000000000f0000-00000000000fffff (prio 1, R-): alias pam-rom @pc.ram 00000000000f0000-00000000000fffff
    00000000000f0000-00000000000fffff (prio 1, RW): alias pam-pci @pci 00000000000f0000-00000000000fffff [disabled]
    00000000fec00000-00000000fec00fff (prio 0, RW): kvm-ioapic
    00000000fee00000-00000000feefffff (prio 4096, RW): kvm-apic-msi
    0000000100000000-000000013fffffff (prio 0, RW): alias ram-above-4g @pc.ram 00000000c0000000-00000000ffffffff

address-space: I/O
  0000000000000000-000000000000ffff (prio 0, RW): io
    0000000000000000-0000000000000007 (prio 0, RW): dma-chan
    0000000000000008-000000000000000f (prio 0, RW): dma-cont
    0000000000000020-0000000000000021 (prio 0, RW): kvm-pic
    0000000000000040-0000000000000043 (prio 0, RW): kvm-pit
    0000000000000060-0000000000000060 (prio 0, RW): i8042-data
    0000000000000061-0000000000000061 (prio 0, RW): pcspk
    0000000000000064-0000000000000064 (prio 0, RW): i8042-cmd
    0000000000000070-0000000000000071 (prio 0, RW): rtc
    000000000000007e-000000000000007f (prio 0, RW): kvmvapic
    0000000000000080-0000000000000080 (prio 0, RW): ioport80
    0000000000000081-0000000000000083 (prio 0, RW): dma-page
    0000000000000087-0000000000000087 (prio 0, RW): dma-page
    0000000000000089-000000000000008b (prio 0, RW): dma-page
    000000000000008f-000000000000008f (prio 0, RW): dma-page
    0000000000000092-0000000000000092 (prio 0, RW): port92
    00000000000000a0-00000000000000a1 (prio 0, RW): kvm-pic
    00000000000000b2-00000000000000b3 (prio 0, RW): apm-io
    00000000000000c0-00000000000000cf (prio 0, RW): dma-chan
    00000000000000d0-00000000000000df (prio 0, RW): dma-cont
    00000000000000f0-00000000000000f0 (prio 0, RW): ioportF0
    0000000000000170-0000000000000177 (prio 0, RW): ide
    00000000000001f0-00000000000001f7 (prio 0, RW): ide
    0000000000000376-0000000000000376 (prio 0, RW): ide
    00000000000003b0-00000000000003df (prio 0, RW): cirrus-io
    00000000000003f1-00000000000003f5 (prio 0, RW): fdc
    00000000000003f6-00000000000003f6 (prio 0, RW): ide
    00000000000003f7-00000000000003f7 (prio 0, RW): fdc
    00000000000003f8-00000000000003ff (prio 0, RW): serial
    00000000000004d0-00000000000004d0 (prio 0, RW): kvm-elcr
    00000000000004d1-00000000000004d1 (prio 0, RW): kvm-elcr
    0000000000000510-0000000000000511 (prio 0, RW): fwcfg
    0000000000000514-000000000000051b (prio 0, RW): fwcfg.dma
    0000000000000600-000000000000063f (prio 0, RW): piix4-pm
      0000000000000600-0000000000000603 (prio 0, RW): acpi-evt
      0000000000000604-0000000000000605 (prio 0, RW): acpi-cnt
      0000000000000608-000000000000060b (prio 0, RW): acpi-tmr
    0000000000000700-000000000000073f (prio 0, RW): pm-smbus
    0000000000000cf8-0000000000000cfb (prio 0, RW): pci-conf-idx
    0000000000000cf9-0000000000000cf9 (prio 1, RW): piix3-reset-control
    0000000000000cfc-0000000000000cff (prio 0, RW): pci-conf-data
    0000000000005658-0000000000005658 (prio 0, RW): vmport
    000000000000ae00-000000000000ae13 (prio 0, RW): acpi-pci-hotplug
    000000000000af00-000000000000af1f (prio 0, RW): acpi-cpu-hotplug
    000000000000afe0-000000000000afe3 (prio 0, RW): acpi-gpe0
    000000000000c000-000000000000c0ff (prio 1, RW): pv_channel
    000000000000c100-000000000000c13f (prio 1, RW): virtio-pci
    000000000000c140-000000000000c15f (prio 1, RW): uhci
    000000000000c160-000000000000c17f (prio 1, RW): virtio-pci
    000000000000c180-000000000000c19f (prio 1, RW): virtio-pci
    000000000000c1a0-000000000000c1af (prio 1, RW): piix-bmdma-container
      000000000000c1a0-000000000000c1a3 (prio 0, RW): piix-bmdma
      000000000000c1a4-000000000000c1a7 (prio 0, RW): bmdma
      000000000000c1a8-000000000000c1ab (prio 0, RW): piix-bmdma
      000000000000c1ac-000000000000c1af (prio 0, RW): bmdma

address-space: i440FX
  0000000000000000-ffffffffffffffff (prio 0, RW): alias bus master @system 0000000000000000-ffffffffffffffff [disabled]

address-space: PIIX3
  0000000000000000-ffffffffffffffff (prio 0, RW): alias bus master @system 0000000000000000-ffffffffffffffff [disabled]

address-space: piix3-ide
  0000000000000000-ffffffffffffffff (prio 0, RW): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: PIIX4_PM
  0000000000000000-ffffffffffffffff (prio 0, RW): alias bus master @system 0000000000000000-ffffffffffffffff [disabled]

address-space: piix3-usb-uhci
  0000000000000000-ffffffffffffffff (prio 0, RW): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: virtio-scsi-pci
  0000000000000000-ffffffffffffffff (prio 0, RW): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: virtio-pci-cfg-as
  0000000000000000-00000000007fffff (prio 0, RW): alias virtio-pci-cfg @virtio-pci 0000000000000000-00000000007fffff

address-space: virtio-serial-pci
  0000000000000000-ffffffffffffffff (prio 0, RW): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: virtio-pci-cfg-as
  0000000000000000-00000000007fffff (prio 0, RW): alias virtio-pci-cfg @virtio-pci 0000000000000000-00000000007fffff

address-space: cirrus-vga
  0000000000000000-ffffffffffffffff (prio 0, RW): alias bus master @system 0000000000000000-ffffffffffffffff [disabled]

address-space: virtio-balloon-pci
  0000000000000000-ffffffffffffffff (prio 0, RW): alias bus master @system 0000000000000000-ffffffffffffffff

address-space: virtio-pci-cfg-as
  0000000000000000-00000000007fffff (prio 0, RW): alias virtio-pci-cfg @virtio-pci 0000000000000000-00000000007fffff

address-space: pv_channel
  0000000000000000-ffffffffffffffff (prio 0, RW): alias bus master @system 0000000000000000-ffffffffffffffff [disabled]

address-space: KVM-SMRAM
  0000000000000000-ffffffffffffffff (prio 0, RW): mem-container-smram
    0000000000000000-00000000ffffffff (prio 10, RW): smram
      00000000000a0000-00000000000bffff (prio 0, RW): alias smram-low @pc.ram 00000000000a0000-00000000000bffff
    0000000000000000-ffffffffffffffff (prio 0, RW): alias mem-smram @system 0000000000000000-ffffffffffffffff

memory-region: pc.ram
  0000000000000000-00000000ffffffff (prio 0, RW): pc.ram

memory-region: vga.vram
  0000000000000000-00000000007fffff (prio 1, RW): vga.vram

memory-region: pc.bios
  00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios

memory-region: pci
  0000000000000000-ffffffffffffffff (prio -1, RW): pci
    00000000000a0000-00000000000bffff (prio 1, RW): cirrus-lowmem-container
      00000000000a0000-00000000000a7fff (prio 1, RW): alias vga.bank0 @vga.vram 0000000000000000-0000000000007fff
      00000000000a0000-00000000000bffff (prio 0, RW): cirrus-low-memory
      00000000000a8000-00000000000affff (prio 1, RW): alias vga.bank1 @vga.vram 0000000000008000-000000000000ffff
    00000000000c0000-00000000000dffff (prio 1, RW): pc.rom
    00000000000e0000-00000000000fffff (prio 1, R-): alias isa-bios @pc.bios 0000000000020000-000000000003ffff
    00000000fc000000-00000000fdffffff (prio 1, RW): cirrus-pci-bar0
      00000000fc000000-00000000fc7fffff (prio 1, RW): vga.vram
      00000000fc000000-00000000fc7fffff (prio 0, RW): cirrus-linear-io
      00000000fd000000-00000000fd3fffff (prio 0, RW): cirrus-bitblt-mmio
    00000000febf0000-00000000febf0fff (prio 1, RW): cirrus-mmio
    00000000febf1000-00000000febf1fff (prio 1, RW): virtio-scsi-pci-msix
      00000000febf1000-00000000febf103f (prio 0, RW): msix-table
      00000000febf1800-00000000febf1807 (prio 0, RW): msix-pba
    00000000febf2000-00000000febf2fff (prio 1, RW): virtio-serial-pci-msix
      00000000febf2000-00000000febf201f (prio 0, RW): msix-table
      00000000febf2800-00000000febf2807 (prio 0, RW): msix-pba
    00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios

memory-region: system
  0000000000000000-ffffffffffffffff (prio 0, RW): system
    0000000000000000-00000000bfffffff (prio 0, RW): alias ram-below-4g @pc.ram 0000000000000000-00000000bfffffff
    0000000000000000-ffffffffffffffff (prio -1, RW): pci
      00000000000a0000-00000000000bffff (prio 1, RW): cirrus-lowmem-container
        00000000000a0000-00000000000a7fff (prio 1, RW): alias vga.bank0 @vga.vram 0000000000000000-0000000000007fff
        00000000000a0000-00000000000bffff (prio 0, RW): cirrus-low-memory
        00000000000a8000-00000000000affff (prio 1, RW): alias vga.bank1 @vga.vram 0000000000008000-000000000000ffff
      00000000000c0000-00000000000dffff (prio 1, RW): pc.rom
      00000000000e0000-00000000000fffff (prio 1, R-): alias isa-bios @pc.bios 0000000000020000-000000000003ffff
      00000000fc000000-00000000fdffffff (prio 1, RW): cirrus-pci-bar0
        00000000fc000000-00000000fc7fffff (prio 1, RW): vga.vram
        00000000fc000000-00000000fc7fffff (prio 0, RW): cirrus-linear-io
        00000000fd000000-00000000fd3fffff (prio 0, RW): cirrus-bitblt-mmio
      00000000febf0000-00000000febf0fff (prio 1, RW): cirrus-mmio
      00000000febf1000-00000000febf1fff (prio 1, RW): virtio-scsi-pci-msix
        00000000febf1000-00000000febf103f (prio 0, RW): msix-table
        00000000febf1800-00000000febf1807 (prio 0, RW): msix-pba
      00000000febf2000-00000000febf2fff (prio 1, RW): virtio-serial-pci-msix
        00000000febf2000-00000000febf201f (prio 0, RW): msix-table
        00000000febf2800-00000000febf2807 (prio 0, RW): msix-pba
      00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios
    00000000000a0000-00000000000bffff (prio 1, RW): alias smram-region @pci 00000000000a0000-00000000000bffff
    00000000000c0000-00000000000c3fff (prio 1, RW): alias pam-ram @pc.ram 00000000000c0000-00000000000c3fff [disabled]
    00000000000c0000-00000000000c3fff (prio 1, RW): alias pam-pci @pc.ram 00000000000c0000-00000000000c3fff [disabled]
    00000000000c0000-00000000000c3fff (prio 1, R-): alias pam-rom @pc.ram 00000000000c0000-00000000000c3fff
    00000000000c0000-00000000000c3fff (prio 1, RW): alias pam-pci @pci 00000000000c0000-00000000000c3fff [disabled]
    00000000000c4000-00000000000c7fff (prio 1, RW): alias pam-ram @pc.ram 00000000000c4000-00000000000c7fff [disabled]
    00000000000c4000-00000000000c7fff (prio 1, RW): alias pam-pci @pc.ram 00000000000c4000-00000000000c7fff [disabled]
    00000000000c4000-00000000000c7fff (prio 1, R-): alias pam-rom @pc.ram 00000000000c4000-00000000000c7fff
    00000000000c4000-00000000000c7fff (prio 1, RW): alias pam-pci @pci 00000000000c4000-00000000000c7fff [disabled]
    00000000000c8000-00000000000cbfff (prio 1, RW): alias pam-ram @pc.ram 00000000000c8000-00000000000cbfff [disabled]
    00000000000c8000-00000000000cbfff (prio 1, RW): alias pam-pci @pc.ram 00000000000c8000-00000000000cbfff [disabled]
    00000000000c8000-00000000000cbfff (prio 1, R-): alias pam-rom @pc.ram 00000000000c8000-00000000000cbfff
    00000000000c8000-00000000000cbfff (prio 1, RW): alias pam-pci @pci 00000000000c8000-00000000000cbfff [disabled]
    00000000000c9000-00000000000cbfff (prio 1000, RW): alias kvmvapic-rom @pc.ram 00000000000c9000-00000000000cbfff
    00000000000cc000-00000000000cffff (prio 1, RW): alias pam-ram @pc.ram 00000000000cc000-00000000000cffff [disabled]
    00000000000cc000-00000000000cffff (prio 1, RW): alias pam-pci @pc.ram 00000000000cc000-00000000000cffff [disabled]
    00000000000cc000-00000000000cffff (prio 1, R-): alias pam-rom @pc.ram 00000000000cc000-00000000000cffff
    00000000000cc000-00000000000cffff (prio 1, RW): alias pam-pci @pci 00000000000cc000-00000000000cffff [disabled]
    00000000000d0000-00000000000d3fff (prio 1, RW): alias pam-ram @pc.ram 00000000000d0000-00000000000d3fff [disabled]
    00000000000d0000-00000000000d3fff (prio 1, RW): alias pam-pci @pc.ram 00000000000d0000-00000000000d3fff [disabled]
    00000000000d0000-00000000000d3fff (prio 1, R-): alias pam-rom @pc.ram 00000000000d0000-00000000000d3fff
    00000000000d0000-00000000000d3fff (prio 1, RW): alias pam-pci @pci 00000000000d0000-00000000000d3fff [disabled]
    00000000000d4000-00000000000d7fff (prio 1, RW): alias pam-ram @pc.ram 00000000000d4000-00000000000d7fff [disabled]
    00000000000d4000-00000000000d7fff (prio 1, RW): alias pam-pci @pc.ram 00000000000d4000-00000000000d7fff [disabled]
    00000000000d4000-00000000000d7fff (prio 1, R-): alias pam-rom @pc.ram 00000000000d4000-00000000000d7fff
    00000000000d4000-00000000000d7fff (prio 1, RW): alias pam-pci @pci 00000000000d4000-00000000000d7fff [disabled]
    00000000000d8000-00000000000dbfff (prio 1, RW): alias pam-ram @pc.ram 00000000000d8000-00000000000dbfff [disabled]
    00000000000d8000-00000000000dbfff (prio 1, RW): alias pam-pci @pc.ram 00000000000d8000-00000000000dbfff [disabled]
    00000000000d8000-00000000000dbfff (prio 1, R-): alias pam-rom @pc.ram 00000000000d8000-00000000000dbfff
    00000000000d8000-00000000000dbfff (prio 1, RW): alias pam-pci @pci 00000000000d8000-00000000000dbfff [disabled]
    00000000000dc000-00000000000dffff (prio 1, RW): alias pam-ram @pc.ram 00000000000dc000-00000000000dffff [disabled]
    00000000000dc000-00000000000dffff (prio 1, RW): alias pam-pci @pc.ram 00000000000dc000-00000000000dffff [disabled]
    00000000000dc000-00000000000dffff (prio 1, R-): alias pam-rom @pc.ram 00000000000dc000-00000000000dffff
    00000000000dc000-00000000000dffff (prio 1, RW): alias pam-pci @pci 00000000000dc000-00000000000dffff [disabled]
    00000000000e0000-00000000000e3fff (prio 1, RW): alias pam-ram @pc.ram 00000000000e0000-00000000000e3fff [disabled]
    00000000000e0000-00000000000e3fff (prio 1, RW): alias pam-pci @pc.ram 00000000000e0000-00000000000e3fff [disabled]
    00000000000e0000-00000000000e3fff (prio 1, R-): alias pam-rom @pc.ram 00000000000e0000-00000000000e3fff
    00000000000e0000-00000000000e3fff (prio 1, RW): alias pam-pci @pci 00000000000e0000-00000000000e3fff [disabled]
    00000000000e4000-00000000000e7fff (prio 1, RW): alias pam-ram @pc.ram 00000000000e4000-00000000000e7fff [disabled]
    00000000000e4000-00000000000e7fff (prio 1, RW): alias pam-pci @pc.ram 00000000000e4000-00000000000e7fff [disabled]
    00000000000e4000-00000000000e7fff (prio 1, R-): alias pam-rom @pc.ram 00000000000e4000-00000000000e7fff
    00000000000e4000-00000000000e7fff (prio 1, RW): alias pam-pci @pci 00000000000e4000-00000000000e7fff [disabled]
    00000000000e8000-00000000000ebfff (prio 1, RW): alias pam-ram @pc.ram 00000000000e8000-00000000000ebfff [disabled]
    00000000000e8000-00000000000ebfff (prio 1, RW): alias pam-pci @pc.ram 00000000000e8000-00000000000ebfff [disabled]
    00000000000e8000-00000000000ebfff (prio 1, R-): alias pam-rom @pc.ram 00000000000e8000-00000000000ebfff
    00000000000e8000-00000000000ebfff (prio 1, RW): alias pam-pci @pci 00000000000e8000-00000000000ebfff [disabled]
    00000000000ec000-00000000000effff (prio 1, RW): alias pam-ram @pc.ram 00000000000ec000-00000000000effff
    00000000000ec000-00000000000effff (prio 1, RW): alias pam-pci @pc.ram 00000000000ec000-00000000000effff [disabled]
    00000000000ec000-00000000000effff (prio 1, R-): alias pam-rom @pc.ram 00000000000ec000-00000000000effff [disabled]
    00000000000ec000-00000000000effff (prio 1, RW): alias pam-pci @pci 00000000000ec000-00000000000effff [disabled]
    00000000000f0000-00000000000fffff (prio 1, RW): alias pam-ram @pc.ram 00000000000f0000-00000000000fffff [disabled]
    00000000000f0000-00000000000fffff (prio 1, RW): alias pam-pci @pc.ram 00000000000f0000-00000000000fffff [disabled]
    00000000000f0000-00000000000fffff (prio 1, R-): alias pam-rom @pc.ram 00000000000f0000-00000000000fffff
    00000000000f0000-00000000000fffff (prio 1, RW): alias pam-pci @pci 00000000000f0000-00000000000fffff [disabled]
    00000000fec00000-00000000fec00fff (prio 0, RW): kvm-ioapic
    00000000fee00000-00000000feefffff (prio 4096, RW): kvm-apic-msi
    0000000100000000-000000013fffffff (prio 0, RW): alias ram-above-4g @pc.ram 00000000c0000000-00000000ffffffff

memory-region: virtio-pci
  0000000000000000-00000000007fffff (prio 0, RW): virtio-pci

memory-region: virtio-pci
  0000000000000000-00000000007fffff (prio 0, RW): virtio-pci

memory-region: virtio-pci
  0000000000000000-00000000007fffff (prio 0, RW): virtio-pci
cheney-lin commented 6 years ago

编译参数

./configure --prefix=/usr --libdir=/usr/lib64 --sysconfdir=/etc --interp-prefix=/usr/qemu-%M --libexecdir=/usr/libexec --with-confsuffix=/qemu-kvm --target-list=x86_64-softmmu --enable-rdma --enable-kvm --enable-numa --disable-sdl --disable-spice --disable-smartcard --enable-linux-aio --enable-debug --enable-hotpatch --enable-debug-info

cheney-lin commented 6 years ago

QEMU bios 加载过程

https://www.ibm.com/developerworks/cn/linux/1410_qiaoly_qemubios/

cheney-lin commented 6 years ago

https://www.seabios.org/Execution_and_code_flow

Seabios Execution and code flow

This page provides a high-level description of some of the major code phases that SeaBIOS transitions through and general information on overall code flow.

Contents [hide] 1 SeaBIOS code phases 1.1 POST phase 1.2 Boot phase 1.3 Main runtime phase 1.4 Resume and reboot 2 Threads 3 Hardware interrupts 4 Extra 16bit stack SeaBIOS code phases The SeaBIOS code goes through a few distinct code phases during its execution lifecycle. Understanding these code phases can help when reading and enhancing the code.

POST phase The Power On Self Test (POST) phase is the initialization phase of the BIOS. This phase is entered when SeaBIOS first starts execution. The goal of the phase is to initialize internal state, initialize external interfaces, detect and setup hardware, and to then start the boot phase.

On emulators, this phase starts when the CPU starts execution in 16bit mode at 0xFFFF0000:FFF0. The emulators map the SeaBIOS binary to this address, and SeaBIOS arranges for romlayout.S:reset_vector() to be present there. This code calls romlayout.S:entry_post() which then calls post.c:handle_post() in 32bit mode.

On coreboot, the build arranges for romlayout.S:entry_elf() to be called in 32bit mode. This then calls post.c:handle_post().

On CSM, the build arranges for romlayout.S:entry_csm() to be called (in 16bit mode). This then calls csm.c:handle_csm() in 32bit mode. Unlike on the emulators and coreboot, the SeaBIOS CSM POST phase is orchestrated with UEFI and there are several calls back and forth between SeaBIOS and UEFI via handle_csm() throughout the POST process.

The POST phase itself has several sub-phases.

The "preinit" sub-phase: code run prior to code relocation. The "init" sub-phase: code to initialize internal variables and interfaces. The "setup" sub-phase: code to setup hardware and drivers. The "prepboot" sub-phase: code to finalize interfaces and prepare for the boot phase. At completion of the POST phase, SeaBIOS invokes an "int 0x19" software interrupt in 16bit mode which begins the boot phase.

Boot phase The goal of the boot phase is to load the first portion of the operating system's boot loader into memory and start execution of that boot loader. This phase starts when a software interrupt ("int 0x19" or "int 0x18") is invoked. The code flow starts in 16bit mode in romlayout.S:entry_19() or romlayout.S:entry_18() which then transition to 32bit mode and call boot.c:handle_19() or boot.c:handle_18().

The boot phase is technically also part of the "runtime" phase of SeaBIOS. It is typically invoked immediately after the POST phase, but it can also be invoked by an operating system or be invoked multiple times in an attempt to find a valid boot media. Although the boot phase C code runs in 32bit mode it does not have write access to the 0x0f0000-0x100000 memory region and can not call the various malloc_X() calls. See Memory Model for more information.

Main runtime phase The main runtime phase occurs after the boot phase starts the operating system. Once in this phase, the SeaBIOS code may be invoked by the operating system using various 16bit and 32bit calls. The goal of this phase is to support these legacy calling interfaces and to provide compatibility with BIOS standards. There are multiple entry points for the BIOS - see the entry_XXX() assembler functions in romlayout.S.

Callers use most of these legacy entry points by setting up a particular CPU register state, invoking the BIOS, and then inspecting the returned CPU register state. To handle this, SeaBIOS will backup the current register state into a "struct bregs" (see romlayout.S, entryfuncs.S, and bregs.h) on call entry and then pass this struct to the C code. The C code can then inspect the register state and modify it. The assembler entry functions will then restore the (possibly modified) register state from the "struct bregs" on return to the caller.

Resume and reboot As noted above, on emulators SeaBIOS handles the 0xFFFF0000:FFF0 machine startup execution vector. This vector is also called on machine faults and on some machine "resume" events. It can also be called (as 0xF0000:FFF0) by software as a request to reboot the machine (on emulators, coreboot, and CSM).

The SeaBIOS "resume and reboot" code handles these calls and attempts to determine the desired action of the caller. Code flow starts in 16bit mode in romlayout.S:reset_vector() which calls romlayout.S:entry_post() which calls romlayout.S:entry_resume() which calls resume.c:handle_resume(). Depending on the request the handle_resume() code may transition to 32bit mode.

Technically this code is part of the "runtime" phase, so even though parts of it run in 32bit mode it still has the same limitations of the runtime phase.

Threads Internally SeaBIOS implements a simple cooperative multi-tasking system. The system works by giving each "thread" its own stack, and the system round-robins between these stacks whenever a thread issues a yield() call. This "threading" system may be more appropriately described as coroutines. These "threads" do not run on multiple CPUs and are not preempted, so atomic memory accesses and complex locking is not required.

The goal of these threads is to reduce overall boot time by parallelizing hardware delays. (For example, by allowing the wait for an ATA hard drive to spin-up and respond to commands to occur in parallel with the wait for a PS/2 keyboard to respond to a setup command.) These hardware setup threads are only available during the "setup" sub-phase of the POST phase.

The code that implements threads is in stacks.c.

Hardware interrupts The SeaBIOS C code always runs with hardware interrupts disabled. All of the C code entry points (see romlayout.S) are careful to explicitly disable hardware interrupts (via "cli"). Because running with interrupts disabled increases interrupt latency, any C code that could loop for a significant amount of time (more than about 1 ms) should periodically call yield(). The yield() call will briefly enable hardware interrupts to occur, then disable interrupts, and then resume execution of the C code.

There are two main reasons why SeaBIOS always runs C code with interrupts disabled. The first reason is that external software may override the default SeaBIOS handlers that are called on a hardware interrupt event. Indeed, it is common for DOS based applications to do this. These legacy third party interrupt handlers may have undocumented expectations (such as stack location and stack size) and may attempt to call back into the various SeaBIOS software services. Greater compatibility and more reproducible results can be achieved by only permitting hardware interrupts at specific points (via yield() calls). The second reason is that much of SeaBIOS runs in 32bit mode. Attempting to handle interrupts in both 16bit mode and 32bit mode and switching between modes to delegate those interrupts is an unneeded complexity. Although disabling interrupts can increase interrupt latency, this only impacts legacy systems where the small increase in interrupt latency is unlikely to be noticeable.

Extra 16bit stack SeaBIOS implements 16bit real mode handlers for both hardware interrupts and software request "interrupts". In a traditional BIOS, these requests would use the caller's stack space. However, the minimum amount of space the caller must provide has not been standardized and very old DOS programs have been observed to allocate very small amounts of stack space (100 bytes or less).

By default, SeaBIOS now switches to its own stack on most 16bit real mode entry points. This extra stack space is allocated in "low memory". It ensures SeaBIOS uses a minimal amount of a callers stack (typically no more than 16 bytes) for these legacy calls. (More recently defined BIOS interfaces such as those that support 16bit protected and 32bit protected mode calls standardize a minimum stack size with adequate space, and SeaBIOS generally will not use its extra stack in these cases.)

The code to implement this stack "hopping" is in romlayout.S and in stacks.c.

cheney-lin commented 6 years ago

Using RCU (Read-Copy-Update) for synchronization

================================================

Read-copy update (RCU) is a synchronization mechanism that is used to protect read-mostly data structures. RCU is very efficient and scalable on the read side (it is wait-free), and thus can make the read paths extremely fast.

RCU supports concurrency between a single writer and multiple readers, thus it is not used alone. Typically, the write-side will use a lock to serialize multiple updates, but other approaches are possible (e.g., restricting updates to a single task). In QEMU, when a lock is used, this will often be the "iothread mutex", also known as the "big QEMU lock" (BQL). Also, restricting updates to a single task is done in QEMU using the "bottom half" API.

RCU is fundamentally a "wait-to-finish" mechanism. The read side marks sections of code with "critical sections", and the update side will wait for the execution of all currently running critical sections before proceeding, or before asynchronously executing a callback.

The key point here is that only the currently running critical sections are waited for; critical sections that are started after the beginning of the wait do not extend the wait, despite running concurrently with the updater. This is the reason why RCU is more scalable than, for example, reader-writer locks. It is so much more scalable that the system will have a single instance of the RCU mechanism; a single mechanism can be used for an arbitrary number of "things", without having to worry about things such as contention or deadlocks.

How is this possible? The basic idea is to split updates in two phases, "removal" and "reclamation". During removal, we ensure that subsequent readers will not be able to get a reference to the old data. After removal has completed, a critical section will not be able to access the old data. Therefore, critical sections that begin after removal do not matter; as soon as all previous critical sections have finished, there cannot be any readers who hold references to the data structure, and these can now be safely reclaimed (e.g., freed or unref'ed).

Here is a picture:

    thread 1                  thread 2                  thread 3
-------------------    ------------------------    -------------------
enter RCU crit.sec.
       |                finish removal phase
       |                begin wait
       |                      |                    enter RCU crit.sec.
exit RCU crit.sec             |                           |
                        complete wait                     |
                        begin reclamation phase           |
                                                   exit RCU crit.sec.

Note how thread 3 is still executing its critical section when thread 2 starts reclaiming data. This is possible, because the old version of the data structure was not accessible at the time thread 3 began executing that critical section.

RCU API

The core RCU API is small:

 void rcu_read_lock(void);

    Used by a reader to inform the reclaimer that the reader is
    entering an RCU read-side critical section.

 void rcu_read_unlock(void);

    Used by a reader to inform the reclaimer that the reader is
    exiting an RCU read-side critical section.  Note that RCU
    read-side critical sections may be nested and/or overlapping.

 void synchronize_rcu(void);

    Blocks until all pre-existing RCU read-side critical sections
    on all threads have completed.  This marks the end of the removal
    phase and the beginning of reclamation phase.

    Note that it would be valid for another update to come while
    synchronize_rcu is running.  Because of this, it is better that
    the updater releases any locks it may hold before calling
    synchronize_rcu.  If this is not possible (for example, because
    the updater is protected by the BQL), you can use call_rcu.

 void call_rcu1(struct rcu_head * head,
                void (*func)(struct rcu_head *head));

    This function invokes func(head) after all pre-existing RCU
    read-side critical sections on all threads have completed.  This
    marks the end of the removal phase, with func taking care
    asynchronously of the reclamation phase.

    The foo struct needs to have an rcu_head structure added,
    perhaps as follows:

        struct foo {
            struct rcu_head rcu;
            int a;
            char b;
            long c;
        };

    so that the reclaimer function can fetch the struct foo address
    and free it:

        call_rcu1(&foo.rcu, foo_reclaim);

        void foo_reclaim(struct rcu_head *rp)
        {
            struct foo *fp = container_of(rp, struct foo, rcu);
            g_free(fp);
        }

    For the common case where the rcu_head member is the first of the
    struct, you can use the following macro.

 void call_rcu(T *p,
               void (*func)(T *p),
               field-name);
 void g_free_rcu(T *p,
                 field-name);

    call_rcu1 is typically used through these macro, in the common case
    where the "struct rcu_head" is the first field in the struct.  If
    the callback function is g_free, in particular, g_free_rcu can be
    used.  In the above case, one could have written simply:

        g_free_rcu(&foo, rcu);

 typeof(*p) atomic_rcu_read(p);

    atomic_rcu_read() is similar to atomic_mb_read(), but it makes
    some assumptions on the code that calls it.  This allows a more
    optimized implementation.

    atomic_rcu_read assumes that whenever a single RCU critical
    section reads multiple shared data, these reads are either
    data-dependent or need no ordering.  This is almost always the
    case when using RCU, because read-side critical sections typically
    navigate one or more pointers (the pointers that are changed on
    every update) until reaching a data structure of interest,
    and then read from there.

    RCU read-side critical sections must use atomic_rcu_read() to
    read data, unless concurrent writes are prevented by another
    synchronization mechanism.

    Furthermore, RCU read-side critical sections should traverse the
    data structure in a single direction, opposite to the direction
    in which the updater initializes it.

 void atomic_rcu_set(p, typeof(*p) v);

    atomic_rcu_set() is also similar to atomic_mb_set(), and it also
    makes assumptions on the code that calls it in order to allow a more
    optimized implementation.

    In particular, atomic_rcu_set() suffices for synchronization
    with readers, if the updater never mutates a field within a
    data item that is already accessible to readers.  This is the
    case when initializing a new copy of the RCU-protected data
    structure; just ensure that initialization of *p is carried out
    before atomic_rcu_set() makes the data item visible to readers.
    If this rule is observed, writes will happen in the opposite
    order as reads in the RCU read-side critical sections (or if
    there is just one update), and there will be no need for other
    synchronization mechanism to coordinate the accesses.

The following APIs must be used before RCU is used in a thread:

 void rcu_register_thread(void);

    Mark a thread as taking part in the RCU mechanism.  Such a thread
    will have to report quiescent points regularly, either manually
    or through the QemuCond/QemuSemaphore/QemuEvent APIs.

 void rcu_unregister_thread(void);

    Mark a thread as not taking part anymore in the RCU mechanism.
    It is not a problem if such a thread reports quiescent points,
    either manually or by using the QemuCond/QemuSemaphore/QemuEvent
    APIs.

Note that these APIs are relatively heavyweight, and should not be nested.

DIFFERENCES WITH LINUX

RCU PATTERNS

Many patterns using read-writer locks translate directly to RCU, with the advantages of higher scalability and deadlock immunity.

In general, RCU can be used whenever it is possible to create a new "version" of a data structure every time the updater runs. This may sound like a very strict restriction, however:

Here are some frequently-used RCU idioms that are worth noting.

RCU list processing

TBD (not yet used in QEMU)

RCU reference counting

Because grace periods are not allowed to complete while there is an RCU read-side critical section in progress, the RCU read-side primitives may be used as a restricted reference-counting mechanism. For example, consider the following code fragment:

rcu_read_lock();
p = atomic_rcu_read(&foo);
/* do something with p. */
rcu_read_unlock();

The RCU read-side critical section ensures that the value of "p" remains valid until after the rcu_read_unlock(). In some sense, it is acquiring a reference to p that is later released when the critical section ends. The write side looks simply like this (with appropriate locking):

qemu_mutex_lock(&foo_mutex);
old = foo;
atomic_rcu_set(&foo, new);
qemu_mutex_unlock(&foo_mutex);
synchronize_rcu();
free(old);

If the processing cannot be done purely within the critical section, it is possible to combine this idiom with a "real" reference count:

rcu_read_lock();
p = atomic_rcu_read(&foo);
foo_ref(p);
rcu_read_unlock();
/* do something with p. */
foo_unref(p);

The write side can be like this:

qemu_mutex_lock(&foo_mutex);
old = foo;
atomic_rcu_set(&foo, new);
qemu_mutex_unlock(&foo_mutex);
synchronize_rcu();
foo_unref(old);

or with call_rcu:

qemu_mutex_lock(&foo_mutex);
old = foo;
atomic_rcu_set(&foo, new);
qemu_mutex_unlock(&foo_mutex);
call_rcu(foo_unref, old, rcu);

In both cases, the write side only performs removal. Reclamation happens when the last reference to a "foo" object is dropped. Using synchronize_rcu() is undesirably expensive, because the last reference may be dropped on the read side. Hence you can use call_rcu() instead:

 foo_unref(struct foo *p) {
    if (atomic_fetch_dec(&p->refcount) == 1) {
        call_rcu(foo_destroy, p, rcu);
    }
}

Note that the same idioms would be possible with reader/writer locks:

read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock);
p = foo;                        p = foo;
/* do something with p. */      foo = new;
read_unlock(&foo_rwlock);       free(p);
                                write_mutex_unlock(&foo_rwlock);
                                free(p);

------------------------------------------------------------------

read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock);
p = foo;                        old = foo;
foo_ref(p);                     foo = new;
read_unlock(&foo_rwlock);       foo_unref(old);
/* do something with p. */      write_mutex_unlock(&foo_rwlock);
read_lock(&foo_rwlock);
foo_unref(p);
read_unlock(&foo_rwlock);

foo_unref could use a mechanism such as bottom halves to move deallocation out of the write-side critical section.

RCU resizable arrays

Resizable arrays can be used with RCU. The expensive RCU synchronization (or call_rcu) only needs to take place when the array is resized. The two items to take care of are:

The first problem is avoided simply by not using realloc. Instead, each resize will allocate a new array and copy the old data into it. The second problem would arise if the size and the data pointers were two members of a larger struct:

struct mystuff {
    ...
    int data_size;
    int data_alloc;
    T   *data;
    ...
};

Instead, we store the size of the array with the array itself:

struct arr {
    int size;
    int alloc;
    T   data[];
};
struct arr *global_array;

read side:
    rcu_read_lock();
    struct arr *array = atomic_rcu_read(&global_array);
    x = i < array->size ? array->data[i] : -1;
    rcu_read_unlock();
    return x;

write side (running under a lock):
    if (global_array->size == global_array->alloc) {
        /* Creating a new version.  */
        new_array = g_malloc(sizeof(struct arr) +
                             global_array->alloc * 2 * sizeof(T));
        new_array->size = global_array->size;
        new_array->alloc = global_array->alloc * 2;
        memcpy(new_array->data, global_array->data,
               global_array->alloc * sizeof(T));

        /* Removal phase.  */
        old_array = global_array;
        atomic_rcu_set(&new_array->data, new_array);
        synchronize_rcu();

        /* Reclamation phase.  */
        free(old_array);
    }

SOURCES

cheney-lin commented 6 years ago

# 内存热插流程分析

Libvirt热插内存流程

无论是利用python还是virsh命令添加设备,都会走qemuDomainAttachDeviceFlags函数:

static int qemuDomainAttachDeviceFlags(virDomainPtr dom, const char *xml,
unsigned int flags)
{
virQEMUDriverPtr driver = dom->conn->privateData;
virDomainObjPtr vm = NULL;
virDomainDefPtr vmdef = NULL;
virDomainDeviceDefPtr dev = NULL, dev_copy = NULL;
int ret = -1;
unsigned int parse_flags = VIR_DOMAIN_DEF_PARSE_INACTIVE |
VIR_DOMAIN_DEF_PARSE_ABI_UPDATE;
virQEMUCapsPtr qemuCaps = NULL;
qemuDomainObjPrivatePtr priv;
virQEMUDriverConfigPtr cfg = NULL;
virCapsPtr caps = NULL;

virCheckFlags(VIR_DOMAIN_AFFECT_LIVE |
VIR_DOMAIN_AFFECT_CONFIG, -1);

virNWFilterReadLockFilterUpdates();

cfg = virQEMUDriverGetConfig(driver);

if (!(caps = virQEMUDriverGetCapabilities(driver, false)))
goto cleanup;

if (!(vm = qemuDomObjFromDomain(dom)))
goto cleanup;

priv = vm->privateData;

if (virDomainAttachDeviceFlagsEnsureACL(dom->conn, vm->def, flags) < 0)
goto cleanup;

if (qemuDomainObjBeginJob(driver, vm, QEMU_JOB_MODIFY) < 0)
goto cleanup;

if (virDomainObjUpdateModificationImpact(vm, &flags) < 0)
goto endjob;

dev = dev_copy = virDomainDeviceDefParse(xml, vm->def,
caps, driver->xmlopt,
parse_flags);
if (dev == NULL)
goto endjob;

if ((virDomainDeviceType) dev->type == VIR_DOMAIN_DEVICE_DISK) {
if (dev->data.disk->iothread &&
virDomainDefCheckDuplicateIOThreadID(vm->def, dev->data.disk)) {
virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
_("A disk is already using iothread_id '%u'"),
dev->data.disk->iothread);
goto endjob;
}
}

if (flags & VIR_DOMAIN_AFFECT_CONFIG &&
flags & VIR_DOMAIN_AFFECT_LIVE) {
/* If we are affecting both CONFIG and LIVE
* create a deep copy of device as adding
* to CONFIG takes one instance.
*/
dev_copy = virDomainDeviceDefCopy(dev, vm->def, caps, driver->xmlopt);
if (!dev_copy)
goto endjob;
}

if ((dev->type == VIR_DOMAIN_DEVICE_NET) && (flags & VIR_DOMAIN_AFFECT_LIVE)) {
if (virDomainDefCheckDuplicateMacAddress(vm->def, dev->data.net)) {
char mac_address[VIR_MAC_STRING_BUFLEN];
virMacAddrFormat(&dev->data.net->mac, mac_address);
virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
_("the mac address %s is used more than one time"), mac_address);
goto endjob;
}
}

if (priv->qemuCaps)
qemuCaps = virObjectRef(priv->qemuCaps);
else if (!(qemuCaps = virQEMUCapsCacheLookup(driver->qemuCapsCache, vm->def->emulator)))
goto cleanup;

if (flags & VIR_DOMAIN_AFFECT_CONFIG) {
/* Make a copy for updated domain. */
vmdef = virDomainObjCopyPersistentDef(vm, caps, driver->xmlopt);
if (!vmdef)
goto endjob;

if (dev->type == VIR_DOMAIN_DEVICE_NET) {
if (virDomainDefCheckDuplicateMacAddress(vmdef, dev->data.net)) {
char mac_address[VIR_MAC_STRING_BUFLEN];
virMacAddrFormat(&dev->data.net->mac, mac_address);
virReportError(VIR_ERR_CONFIG_UNSUPPORTED,
_("the mac address %s is used more than one time"), mac_address);
goto endjob;
}
}

if (virDomainDefCompatibleDevice(vmdef, dev,
VIR_DOMAIN_DEVICE_ACTION_ATTACH) < 0)
goto endjob;

if ((ret = qemuDomainAttachDeviceConfig(qemuCaps, vmdef, dev,
dom->conn)) < 0)
goto endjob;
}

if (flags & VIR_DOMAIN_AFFECT_LIVE) {
if (virDomainDefCompatibleDevice(vm->def, dev_copy,
VIR_DOMAIN_DEVICE_ACTION_ATTACH) < 0)
goto endjob;

if ((ret = qemuDomainAttachDeviceLive(vm, dev_copy, dom)) < 0)
goto endjob;
/*
* update domain status forcibly because the domain status may be
* changed even if we failed to attach the device. For example,
* a new controller may be created.
*/
if (virDomainSaveStatus(driver->xmlopt, cfg->stateDir, vm, driver->caps) < 0) {
ret = -1;
goto endjob;
}
}

该函数的三个参数在4.4接口描述已经说明过了,其中flags很关键,它会影响后续的行为,以下讨论xml为内存设备并且flags=3的情况;这个函数主要作了以下几件事情:

  1. 解析xml,生成相应的设备对象,设备对象包含了xml描述的所有属性;
  2. 更新当前设备的xml
  3. 热插内存,也就是给qemu发qmp消息

QEMU热插内存流程

上文提到Libvirt向QEMU下发了两条qmp命令,QEMU侧的处理函数分别是qmp_object_add 和 qmp_device_add,前者创建了QOM对象,后者创建了设备并将其初始化,以下的分析从这两个函数展开。

qmp_object_add

创建一个新的设备对象(memory后端设备),dimm设备的创建在qmp_device_add流程中 当后端设备类型为ram时调用栈如下:

qmp_object_add
user_creatable_add_type //check是否继承了TYPE_USER_CREATABLE接口,是否为虚类等
object_new
object_property_set 设置size等属性
object_property_add_child
user_creatable_complete
host_memory_backend_memory_complete(ucc->complete)
ram_backend_memory_alloc(bc->alloc)
object_get_canonical_path_component
memory_region_init_ram
qemu_ram_alloc
ram_block_add
last_ram_offset
find_ram_offset(寻找新内存条可以映射的物理地址空间)
qemu_anon_ram_alloc(phys_mem_alloc) 为内存条真正申请内存下面

来看object_add主要创建了Host Memory Backend:

void qmp_object_add(const char *type, const char *id,
bool has_props, QObject *props, Error **errp)
{
QDict *pdict;
Visitor *v;
Object *obj;

if (props) {
pdict = qobject_to_qdict(props);
if (!pdict) {
error_setg(errp, QERR_INVALID_PARAMETER_TYPE, "props", "dict");
return;
}
QINCREF(pdict);
} else {
pdict = qdict_new();
}

v = qobject_input_visitor_new(QOBJECT(pdict), true);
obj = user_creatable_add_type(type, id, pdict, v, errp);
visit_free(v);
if (obj) {
object_unref(obj);
}
QDECREF(pdict);
}

参考qmp命令的格式就知道这几个入参是什么了

2018-02-27T17:23:13.817362+08:00|info|qemu[5063]|[5063]|do_qmp_dispatch[109]|: qmp_cmd_name: object-add,
arguments: {"qom-type": "memory-backend-ram", "props": {"size": 1073741824}, "id": "memdimm2"}

配置了大页时memory-backend类型是不同的:

2018-02-14T16:59:05.542215+08:00|info|qemu[17053]|[17053]|do_qmp_dispatch[109]|: qmp_cmd_name: object-add, arguments: {"qom-type": "memory-backend-file", "props": {"share": true, "prealloc": true, "size": 67108864, "mem-path": "/dev/hugepages/libvirt/qemu/11-redhat_7.1"}, "id": "memdimm0"}

user_creatable_complete用来创建“memory-backend-ram”设备,这个设备的相关定义在backends/hostmem-ram.c:

#define TYPE_MEMORY_BACKEND_RAM "memory-backend-ram"

static void
ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
{
char *path;

if (!backend->size) {
error_setg(errp, "can't create backend with size 0");
return;
}

path = object_get_canonical_path_component(OBJECT(backend));
memory_region_init_ram(&backend->mr, OBJECT(backend), path,
backend->size, errp);
g_free(path);
}

static void
ram_backend_class_init(ObjectClass *oc, void *data)
{
HostMemoryBackendClass *bc = MEMORY_BACKEND_CLASS(oc);

bc->alloc = ram_backend_memory_alloc;
}

static const TypeInfo ram_backend_info = {
.name = TYPE_MEMORY_BACKEND_RAM,
.parent = TYPE_MEMORY_BACKEND,
.class_init = ram_backend_class_init,
};

static void register_types(void)
{
type_register_static(&ram_backend_info);
}

type_init(register_types);

首先object_new,根据已有知识,这是对象的”构造函数”,对于”memory-backend-ram”来说没做什么,只是把对象类的相关函数注册了一下。 然后是object_property_set ,设置了相关属性,比如size等;再将”memory-backend-ram”对象作为child加入到container对象的属性hash表中,建立起两者的父子关系。(Object之间的关系见 https://wiki.qemu.org/Features/QOM)

void object_property_add_child(Object *obj, const char *name,
Object *child, Error **errp)
{
Error *local_err = NULL;
gchar *type;
ObjectProperty *op;

if (child->parent != NULL) {
error_setg(errp, "child object is already parented");
return;
}

type = g_strdup_printf("child<%s>", object_get_typename(OBJECT(child)));

op = object_property_add(obj, name, type, object_get_child_property, NULL,
object_finalize_child_property, child, &local_err);
if (local_err) {
error_propagate(errp, local_err);
goto out;
}

op->resolve = object_resolve_child_property;
object_ref(child);
child->parent = obj;

out:
g_free(type);
}

然后调用user_creatable_complete,实际调用的是host_memory_backend_memory_complete,在这个函数中为内存条真正申请内存、将qemu申请的虚拟内存与host numa绑定。 初始化后端内存对应的mr并完成内存申请。

static void
ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
{
char *path;

if (!backend->size) {
error_setg(errp, "can't create backend with size 0");
return;
}

path = object_get_canonical_path_component(OBJECT(backend));
memory_region_init_ram(&backend->mr, OBJECT(backend), path,
backend->size, errp);
g_free(path);
}
void memory_region_init_ram(MemoryRegion *mr,
Object *owner,
const char *name,
uint64_t size,
Error **errp)
{
memory_region_init(mr, owner, name, size);
mr->ram = true;
mr->terminates = true;
mr->destructor = memory_region_destructor_ram;
mr->ram_block = qemu_ram_alloc(size, mr, errp);
mr->dirty_log_mask = tcg_enabled() ? (1 << DIRTY_MEMORY_CODE) : 0;
}

申请内存

static
RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
void (*resized)(const char*,
uint64_t length,
void *host),
void *host, bool resizeable,
MemoryRegion *mr, Error **errp)
{
RAMBlock *new_block;
Error *local_err = NULL;

size = HOST_PAGE_ALIGN(size);
max_size = HOST_PAGE_ALIGN(max_size);
new_block = g_malloc0(sizeof(*new_block));
new_block->mr = mr;
new_block->resized = resized;
new_block->used_length = size;
new_block->max_length = max_size;
assert(max_size >= size);
new_block->fd = -1;
new_block->page_size = getpagesize();
new_block->host = host;
if (host) {
new_block->flags |= RAM_PREALLOC;
}
if (resizeable) {
new_block->flags |= RAM_RESIZEABLE;
}
ram_block_add(new_block, &local_err);
if (local_err) {
g_free(new_block);
error_propagate(errp, local_err);
return NULL;
}
return new_block;
}

last_ram_offset找到当前ram所占的物理地址空间的大小,find_ram_offset寻找新的后端内存可以安插的物理地址,qemu_anon_ram_alloc_noreserve真正申请内存,如果新的内存物理地址范围大于原来的物理地址空间的范围,就去更新两个bitmap(与热迁移相关)。 至此object_add的流程就分析完了。

qmp_device_add

将新创建的dimm设备插入虚拟机 调用栈如下:

qmp_device_add
qdev_device_add
DEVICE(object_new(driver)) 创建dimm设备对象
pc_dimm_init
qemu_opt_foreach(opts, set_property, dev, &err) 设置属性 包括 addr slot size等
object_property_set_bool(OBJECT(dev), true, "realized", &err)
device_set_realized
pc_dimm_realize(dc->realize)
pc_machine_device_plug_cb(hotplug_handler_plug)
pc_dimm_plug
pc_dimm_memory_plug
memory_region_add_subregion
memory_region_add_subregion_common
memory_region_update_container_subregions
memory_region_transaction_commit
piix4_device_plug_cb(hhc->plug)
acpi_memory_plug_cb
acpi_send_event
piix4_send_gpe
acpi_send_gpe_event
acpi_update_sci

既然要讨论dimm设备的创建,那么首先要了解下这个设备的模型,相关代码在hw/mem/pc-dimm.c

static TypeInfo pc_dimm_info = {
.name = TYPE_PC_DIMM,
.parent = TYPE_DEVICE,
.instance_size = sizeof(PCDIMMDevice),
.instance_init = pc_dimm_init,
.class_init = pc_dimm_class_init,
.class_size = sizeof(PCDIMMDeviceClass),
};

类初始化

static void pc_dimm_class_init(ObjectClass *oc, void *data)
{
DeviceClass *dc = DEVICE_CLASS(oc);
PCDIMMDeviceClass *ddc = PC_DIMM_CLASS(oc);

dc->realize = pc_dimm_realize;
dc->unrealize = pc_dimm_unrealize;
dc->props = pc_dimm_properties;
dc->desc = "DIMM memory module";

ddc->get_memory_region = pc_dimm_get_memory_region;
ddc->get_vmstate_memory_region = pc_dimm_get_vmstate_memory_region;
}

dimm类具有的属性包括addr(物理地址) slot(内存槽号) node(numa节点号):

static Property pc_dimm_properties[] = {
DEFINE_PROP_UINT64(PC_DIMM_ADDR_PROP, PCDIMMDevice, addr, 0),
DEFINE_PROP_UINT32(PC_DIMM_NODE_PROP, PCDIMMDevice, node, 0),
DEFINE_PROP_INT32(PC_DIMM_SLOT_PROP, PCDIMMDevice, slot,
PC_DIMM_UNASSIGNED_SLOT),
DEFINE_PROP_END_OF_LIST(),
};

在对象初始化的时候添加了size memory_backend属性,size是内存大小,memory_backend可以是ram也可以是文件,也就是说在dimm设备初始化后它具有了5个属性。

static void pc_dimm_init(Object *obj)
{
PCDIMMDevice *dimm = PC_DIMM(obj);

object_property_add(obj, PC_DIMM_SIZE_PROP, "int", pc_dimm_get_size,
NULL, NULL, NULL, &error_abort);
object_property_add_link(obj, PC_DIMM_MEMDEV_PROP, TYPE_MEMORY_BACKEND,
(Object **)&dimm->hostmem,
pc_dimm_check_memdev_is_busy,
OBJ_PROP_LINK_UNREF_ON_RELEASE,
&error_abort);
}

device_add的qmp命令参数:

2018-02-27T17:23:13.819014+08:00|info|qemu[5063]|[5063]|do_qmp_dispatch[109]|: qmp_cmd_name: device_add, arguments: {"memdev": "memdimm2", "driver": "pc-dimm", "slot": "2", "node": "0", "id": "dimm2"}

qdev_device_add是添加设备的入口函数,在虚拟机启动时创建设备以及热插设备时都会进这个函数,它的主要作用是新建一个设备并将它插入到对应的总线中去。以下针对dimm设备来分析一下。

  1. 首先解析命令参数获得driver(“pc-dimm”),bus(NULL)等;
  2. object_new创建设备;
  3. object_property_add_child 加入到/peripheral为根的qom-tree中去;
  4. qemu_opt_foreach(opts, set_property, dev, &err) 将解析出来的参数设置为dimm设备的属性;
  5. object_property_set_bool 设置realized属性来激活设备,这里面有热插内存的动作。 追随着回调函数的脚步来到device_set_realized,主要调用了dc->realize 和 hotplug_handler_plug,对应的回调函数是pc_dimm_realize 和 pc_machine_device_plug_cb(pc_dimm_plug),前者只是一些错误检查,后者比较关键。
static void pc_dimm_plug(HotplugHandler *hotplug_dev,
DeviceState *dev, Error **errp)
{
HotplugHandlerClass *hhc;
Error *local_err = NULL;
PCMachineState *pcms = PC_MACHINE(hotplug_dev);
PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
PCDIMMDevice *dimm = PC_DIMM(dev);
PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
MemoryRegion *mr = ddc->get_memory_region(dimm);
uint64_t align = TARGET_PAGE_SIZE;

if (memory_region_get_alignment(mr) && pcmc->enforce_aligned_dimm) {
align = memory_region_get_alignment(mr);
}

if (!pcms->acpi_dev) {
error_setg(&local_err,
"memory hotplug is not enabled: missing acpi device");
goto out;
}

pc_dimm_memory_plug(dev, &pcms->hotplug_memory, mr, align, &local_err);
if (local_err) {
goto out;
}

if (object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM)) {
nvdimm_plug(&pcms->acpi_nvdimm_state);
}

hpms = &pcms->hotplug_memory;
hhc = HOTPLUG_HANDLER_GET_CLASS(pcms->acpi_dev);
hhc->plug(HOTPLUG_HANDLER(pcms->acpi_dev), dev, &error_abort);
out:
error_propagate(errp, local_err);
}
  1. memory_region_get_alignment 获得页面大小;
  2. pc_dimm_memory_plug 创建dimm设备,更新guest物理地址空间拓扑,与KVM交互等;
  3. piix4_device_plug_cb(hhc->plug),acpi发生一些事情 我们比较关心的当然是pc_dimm_memory_plug了,做了以下几件事:
  4. 根据当前虚拟机的状况计算得到dimm设备的物理地址以及slot号;
  5. 添加新的mr,更新内存视图及拓扑结构,通过ioctl(KVM_SET_USER_MEMORY_REGION)给kvm下发地址空间的变化;
  6. 将新的mr关联到guest numa节点 memory_region_add_subregion 就是更新mr的拓扑等,涉及到很多QEMU侧内存虚拟化的知识,非常复杂,不展开说了,可以参考http://bobao.360.cn/learning/detail/4092.html。
cheney-lin commented 6 years ago
0x00000 +----------- 0x00000~0x9FFFF 10x64K=640K; 基本内存
       |
       | 1k 中断向量表 每一项4个字节 共256项
0x003FF|----------- 
0x00400|----------- 
       | 256Byte bios数据区
0x004FF|----------
0x00500|---------- 
       | 
       | 自由内存区 但0x07C00-0x07DFF(512Byte)为引导程序加载区
       | 
0x9FFFF|-----------
0xA0000|-----------     
       |2x64K=128K;  作为显存使用
       |
       |0xa0000-0xb0000 EGA/VGA/XGA/XVGA图形视频缓冲区
       |0xb0000-0xb8000 Mono text video buffer
       |0xb8000-0xbffff CGA/EGA+ chroma text video buffer
       |
0xBFFFF|-----------
0xC0000|----------- 0xC0000~0xFFFFF 264K 由bios使用,地址如何利用由其自己决定
       |
       |
       |32k  显卡bios使用
0xC7FFF|-----------
0xC8000|-----------
       |
       |16k IDE控制器bios使用
0xCBFFF|----------
       |
       |
0xF0000|----------
       |
       |
       |64k 系统bios使用
       |
       |
       |
0xFFFFF|-----------

  0000000000000000-000000000009ffff (prio 0, ram): pc.ram
  00000000000a0000-00000000000affff (prio 1, ram): vga.vram
  00000000000b0000-00000000000bffff (prio 0, i/o): cirrus-low-memory @0000000000010000
  00000000000c0000-00000000000c9fff (prio 0, rom): pc.ram @00000000000c0000
  00000000000ca000-00000000000ccfff (prio 0, ram): pc.ram @00000000000ca000
  00000000000cd000-00000000000e7fff (prio 0, rom): pc.ram @00000000000cd000
  00000000000e8000-00000000000effff (prio 0, ram): pc.ram @00000000000e8000
  00000000000f0000-00000000000fffff (prio 0, rom): pc.ram @00000000000f0000
  0000000000100000-000000001fffffff (prio 0, ram): pc.ram @0000000000100000
  00000000fc000000-00000000fc3fffff (prio 1, ram): vga.vram
  00000000fd000000-00000000fd3fffff (prio 0, i/o): cirrus-bitblt-mmio
  00000000fe000000-00000000fe000fff (prio 0, i/o): virtio-pci-common
  00000000fe001000-00000000fe001fff (prio 0, i/o): virtio-pci-isr
  00000000fe002000-00000000fe002fff (prio 0, i/o): virtio-pci-device
  00000000fe003000-00000000fe003fff (prio 0, i/o): virtio-pci-notify
  00000000febd0000-00000000febd0fff (prio 1, i/o): cirrus-mmio
  00000000fec00000-00000000fec00fff (prio 0, i/o): ioapic
  00000000fed00000-00000000fed003ff (prio 0, i/o): hpet
  00000000fee00000-00000000feefffff (prio 4096, i/o): apic-msi
  00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios