aws / aws-fpga

Official repository of the AWS EC2 FPGA Hardware and Software Development Kit
Other
1.51k stars 516 forks source link

How to issue a PCIe FLR to CL #652

Open ns-intusurg opened 1 month ago

ns-intusurg commented 1 month ago

Hi,

What is the runtime procedure to issue a PCIe Function Level Reset (sh_cl_flr_assert) to the CL?

I found the HDK's tb.issue_flr() command used for simulation, but I couldn't find any SDK runtime equivalent C function in the repo or in the documentation.

Thanks

AWSjoeluc commented 1 month ago

Hello! Thanks for reaching out with your question. I assume you've found mention of FLR in the documentation here: https://github.com/HFTrader/aws-fpga/blob/master/hdk/docs/AWS_Shell_Interface_Specification.md#function-level-reset-flr

Linux platforms exposes access to the FLR with /sys/bus/pci/devices/$BDF/reset where $BDF is the bus device function number of the targeted function. To trigger an FLR, you can try the following commands:

echo 1 > /sys/bus/pci/devices/$BDF/reset

    OR

echo 1 | sudo tee -a /sys/bus/pci/devices/$BDF/reset
ns-intusurg commented 1 month ago

I don't see "reset " listed under the PCI device directory and I'm getting a "No such file or directory" error.

I'm targeting the following device path which is used during the test: /sys/devices/pci0000:00/0000:00:1d.0

Here are the results of "ls -la" under that path:

.. uevent . vendor subsystem -> ../../../bus/pci xdma driver -> ../../../bus/pci/drivers/xdma subsystem_vendor subsystem_device device resource4_wc resource4 resource2_wc resource2 resource1 resource0 revision resource rescan remove power_state power numa_node msi_irqs msi_bus modalias max_link_width max_link_speed local_cpus local_cpulist link irq firmware_node -> ../../LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:ea enable driver_override dma_mask_bits d3cold_allowed current_link_width current_link_speed consistent_dma_mask_bits config class broken_parity_status ari_enabled

AWSjoeluc commented 1 month ago

That's unexpected, can you share what instance size and AMI you're using? What's the result of lspci -d 1d0f: -vv?

ns-intusurg commented 1 month ago

I'll have to ask IT about the instance size and AMI, as they set everything up and I don't have access to the amazon admin account.

sudo lspci -d 1d0f: -vv

00:03.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) Physical Slot: 3 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx- Latency: 0 Region 0: Memory at 85610000 (32-bit, non-prefetchable) [size=16K] Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed unknown, Width x0, ASPM not supported ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown (ok), Width x0 (ok) TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [b0] MSI-X: Enable+ Count=9 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Kernel driver in use: ena Kernel modules: ena

00:1c.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) Physical Slot: 28 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 35 NUMA node: 0 Region 0: Memory at 85614000 (64-bit, non-prefetchable) [size=16K] Region 2: Memory at 85620000 (64-bit, prefetchable) [size=8K] Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed unknown, Width x0, ASPM not supported ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown (ok), Width x0 (ok) TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [b0] MSI-X: Enable+ Count=32 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Kernel driver in use: nvme Kernel modules: nvme

00:1d.0 Memory controller: Amazon.com, Inc. Device f000 Subsystem: Device fedd:1d51 Physical Slot: 29 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx- Latency: 0 Region 0: Memory at 82000000 (32-bit, non-prefetchable) [size=32M] Region 1: Memory at 85400000 (32-bit, non-prefetchable) [size=2M] Region 2: Memory at 85600000 (64-bit, prefetchable) [size=64K] Region 4: Memory at 2000000000 (64-bit, prefetchable) [size=128G] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [60] MSI-X: Enable+ Count=33 Masked- Vector table: BAR=2 offset=00008000 PBA: BAR=2 offset=00008fe0 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x16 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Kernel driver in use: xdma Kernel modules: xdma

00:1e.0 Memory controller: Amazon.com, Inc. Device 1041 Subsystem: Xilinx Corporation Device 0007 Physical Slot: 30 Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx- Region 0: Memory at 85618000 (64-bit, prefetchable) [size=16K] Region 2: Memory at 8561c000 (64-bit, prefetchable) [size=16K] Region 4: Memory at 85000000 (64-bit, prefetchable) [size=4M] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x16 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported

AWSjoeluc commented 1 month ago

Great, thank you. A uname -a would also help in place of the full AMI ID (unless the kernel data contains sensitive information).

ns-intusurg commented 1 month ago

Waiting for the reply from IT. Here's the command minus the network node hostname:

uname -a Linux ######## 4.18.0-513.24.1.el8_9.x86_64 #1 SMP Thu Mar 14 14:20:09 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

AWSjoeluc commented 1 month ago

I'm currently investigating this behavior internally, I hope to have a response by the end of this week. Thank you for your patience!

ns-intusurg commented 1 month ago

Just in case you still needed this info about our setup:

instance size - f1.2xlarge ami - RHEL-8.7.0_HVM-20230330-x86_64-56-Hourly2-GP2