Open rh-codebase opened 4 years ago
I've verified that the cuda software packages downloaded from Nvidia's website is all that is needed to get cuda/gpu running on the host and in LXD containers. The root issue before is because the nvidia drivers and support daemons were not running on the host(or being loaded on boot). When installing the cuda software there is mention in the output to reboot the machine in order to get the drivers loaded. The other way(discovered accidentally) is to run nvidia-smi on the host after a fresh install. @VR-DSA reported that on lxd110master(with older cuda software) the kernel modules are not loaded on reboot. It doesn't look like the cuda software on lxd110master was installed via a package.
dsa@lxd110master:~$ apt show cuda N: Unable to locate package cuda N: Unable to locate package cuda E: No packages found
After the cuda software on the host is installed, then launching a LXD container with a profile setup to add the gpu device and install cuda works as well.
rebooting the main host does load the nvidia drivers and necessary daemon processes so all is working as it should now.
And MAAS is now able to provision all of this automagically but will be testing with more servers soon.
Cuda software being used on host:
ubuntu@lxd110h0:~$ apt show cuda Package: cuda Version: 10.2.89-1
Here is a preliminary list of packages and their sources that need to be installed on the correlator container. There are also some configuration steps that need to be taken. Some work needs to be done to make the below a reality!
*** Configuration steps:
Change the MTU to 9000 bytes, following https://djanotes.blogspot.com/2018/01/netplan-setting-mtu-for-bridge-devices.html . This requires work on the netplan of the physical (18.08) machine, not the container.
Place the following in /etc/sysctl.conf on physical machine as well as on container to enable faster packet capture net.ipv4.conf.all.rp_filter = 0 kernel.shmmax = 68719476736 kernel.shmall = 4294967296 net.core.netdev_max_backlog = 250000 net.core.wmem_max = 536870912 net.core.rmem_max = 536870912 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 16777216 net.ipv4.tcp_mem = 16777216 16777216 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 87380 16777216 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_low_latency = 1
(may also need to adjust /etc/security.limits.conf - TBD)
*** Installs from standard repositories of debs:
*** Python
conda env create -f environment.yml
using the yml file in the dsa110-calib repo*** Installs from specific sources
*** Installs of code repos that we've modified (and so should be on our github - TBD)
An update to this issue: I've put some code in branch v0.9 of dsa110-xengine, which can be used to test compilation on the correlator container. We also need to find mbheimdall.
One thing to check is that the xgpu and psrdada versions match those on the dsa110master.ovro.pvt container.
compilation failed since xgpu and psrdada are not yet in our repos and thus not installed in the container when provisioned. The TBD in the above comment sounds like they should be. Taking a look at dsa110master
xGPU on dsa110master has a small change to the Makefile. Creating dsa110-xGPU to track local changes. moving the remote origin to upstream as typical convention. Creating new origin to point to dsa110 on github. To pull in future changes from upstream: git fetch upstream. Then rebase our changes on top of upstream's lastest stuff. Then see if this all builds in the container. @VR-DSA Do you want this built on the container's host as well?
Thanks Rick! xGPU only needs to be on the container, not the host.
On Jul 23, 2020, at 8:09 AM, rh-codebase notifications@github.com<mailto:notifications@github.com> wrote:
xGPU on dsa110master has a small change to the Makefile. Creating dsa110-xGPU to track local changes. moving the remote origin to upstream as typical convention. Creating new origin to point to dsa110 on github. To pull in future changes from upstream: git fetch upstream. Then rebase our changes on top of upstream's lastest stuff. Then see if this all builds in the container. @VR-DSAhttps://github.com/VR-DSA Do you want this built on the container's host as well?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dsa110/dsa110-issues/issues/5#issuecomment-663061880, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALSPILHRFWFJUBI3PSKMHKLR5BHA3ANCNFSM4KTDYVJQ.
compile error in container: ubuntu@corr00:~/proj/dsa110-shell/dsa110-xGPU/src$ make nvcc fatal : Value 'sm_30' is not defined for option 'gpu-architecture' Makefile:199: recipe for target 'cuda_correlator.o' failed
The conainter sees the following GPUs ubuntu@corr00:~/proj/dsa110-shell/dsa110-xGPU/src$ nvidia-debugdump --list Found 2 NVIDIA devices Device ID: 0 Device name: GeForce RTX 2080 Ti GPU internal ID: GPU-cb4ab609-d0de-fab1-b368-f4b448a91763
Device ID: 1
Device name: GeForce RTX 2080 Ti
GPU internal ID: GPU-c71d7823-e96c-ce09-6bfd-ff52f6443ab8
which are the same cards as on lxd110master(the host for dsa110master). Hmmm. This link is useful. https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ Change CUDA_ARCH in Makefile to sm_75 as per link above. Compile succeeds. Don't understand how this compiled on dsa110master unless a change wasn't committed? Ideas?
Easy :). "make" won't cut it.
make CUDA_ARCH=sm_75 NPOL=2 NSTATION=64 NFREQUENCY=768 NTIME_PIPE=128 NTIME=2048
After this succeeds, the output of the command "xgpuinfo" (in the src dir) should be exactly: xGPU library version: 2.0.0+dirty Number of polarizations: 2 Number of stations: 64 Number of baselines: 2080 Number of frequencies: 768 Number of time samples per GPU integration: 2048 Number of time samples per transfer to GPU: 128 Type of ComplexInput components: 8 bit integers Type of computation: FP32 multiply, FP32 accumulate Number of ComplexInput elements in GPU input vector: 201326592 Number of ComplexInput elements per transfer to GPU: 12582912 Number of Complex elements in GPU output vector: 6389760 Number of Complex elements in reordered output vector: 6389760 Output matrix order: triangular Shared atomic transfer size: 4 Complex block size: 1
Indeed it does. ubuntu@corr00:~/proj/dsa110-shell/dsa110-xGPU/src$ ./xgpuinfo xGPU library version: 2.0.0+1@gf582271-dirty Number of polarizations: 2 Number of stations: 64 Number of baselines: 2080 Number of frequencies: 768 Number of time samples per GPU integration: 2048 Number of time samples per transfer to GPU: 128 Type of ComplexInput components: 8 bit integers Type of computation: FP32 multiply, FP32 accumulate Number of ComplexInput elements in GPU input vector: 201326592 Number of ComplexInput elements per transfer to GPU: 12582912 Number of Complex elements in GPU output vector: 6389760 Number of Complex elements in reordered output vector: 6389760 Output matrix order: triangular Shared atomic transfer size: 4 Complex block size: 1
Since we are modifying the Makefile for DSA110, is there a reason not to include these variables so that make just works? Or do you play with these numbers often?
Well, they will be modified at some point, but not for the next several months. Up to you!
On Jul 23, 2020, at 10:03 AM, rh-codebase notifications@github.com<mailto:notifications@github.com> wrote:
Indeed it does. ubuntu@corr00:~/proj/dsa110-shell/dsa110-xGPU/src$ ./xgpuinfo xGPU library version: 2.0.0+1@gf582271-dirty Number of polarizations: 2 Number of stations: 64 Number of baselines: 2080 Number of frequencies: 768 Number of time samples per GPU integration: 2048 Number of time samples per transfer to GPU: 128 Type of ComplexInput components: 8 bit integers Type of computation: FP32 multiply, FP32 accumulate Number of ComplexInput elements in GPU input vector: 201326592 Number of ComplexInput elements per transfer to GPU: 12582912 Number of Complex elements in GPU output vector: 6389760 Number of Complex elements in reordered output vector: 6389760 Output matrix order: triangular Shared atomic transfer size: 4 Complex block size: 1
Since we are modifying the Makefile for DSA110, is there a reason not to include these variables so that make just works? Or do you play with these numbers often?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dsa110/dsa110-issues/issues/5#issuecomment-663121962, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALSPILE3ZHJLR2FSVU355JDR5BUOBANCNFSM4KTDYVJQ.
makes more sense to me to track in the dsa110-xGPU repo instead of the maas repo. I'll update the repo. dsa110-xGPU is now successfully building and installing(/usr/local) in the containers:
root@corr00:/home/ubuntu/proj/dsa110-shell/dsa110-xGPU/src# xgpuinfo xGPU library version: 2.0.0+1@gf582271 Number of polarizations: 2 Number of stations: 64 Number of baselines: 2080 Number of frequencies: 768 Number of time samples per GPU integration: 2048 Number of time samples per transfer to GPU: 128 Type of ComplexInput components: 8 bit integers Type of computation: FP32 multiply, FP32 accumulate Number of ComplexInput elements in GPU input vector: 201326592 Number of ComplexInput elements per transfer to GPU: 12582912 Number of Complex elements in GPU output vector: 6389760 Number of Complex elements in reordered output vector: 6389760 Output matrix order: triangular Shared atomic transfer size: 4 Complex block size: 1
PSRDADA next to get integrated into our repo and built. Most likely there will be changes to the Makefile in dsa110-xengine to look for include directories in these newly installed areas.
dsa110-psrdada compiles in container. dsa110-xengine is failing to link. can't find sigproc. On dsa110master, /usr/local/sigproc is a git repo with local modifications. psrdada though also contains a sigproc dir. Which one to use? If /usr/local/sigproc then I'll create a dsa110-sigproc and go from there. Also, cuda is installed via apt package manager. On dsa110master version 10.2 was used. In the container, 11.2 I believe is used. Will this matter?
on dsa110master in /usr/local/sigproc/src repo there's an untracked file fake_frb.c which looks to be a copy of fake.c but with 2 line changes. Should this file be tracked? There's no corresponding .o
Interesting! The sigproc that should be used is /usr/local/sigproc on dsa110master. The version of cuda is an interesting one. I don’t think it should matter, but there’s a good chance it will. But in any case it’s best to use 11.2 rather than 10.2, so if something breaks let me know and I can try to fix it.
fake_frb.c should probably be tracked as well in sigproc.
Thanks for sorting through all this!
On Jul 23, 2020, at 6:53 PM, rh-codebase notifications@github.com<mailto:notifications@github.com> wrote:
on dsa110master in /usr/local/sigproc/src repo there's an untracked file fake_frb.c which looks to be a copy of fake.c but with 2 line changes. Should this file be tracked? There's no corresponding .o
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dsa110/dsa110-issues/issues/5#issuecomment-663314271, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALSPILH3GRKRPXRANSK3ASDR5DSTPANCNFSM4KTDYVJQ.
I have a working container install of the dsa110-xengine repo. root@corr00:/home/ubuntu/proj/dsa110-shell/dsa110-xengine/src# ls -l | grep rwx -rwxr-xr-x 1 ubuntu ubuntu 289336 Jul 24 20:13 dsaX_beamformer -rwxr-xr-x 1 ubuntu ubuntu 297872 Jul 24 20:14 dsaX_capture -rwxr-xr-x 1 ubuntu ubuntu 225160 Jul 24 20:14 dsaX_dbnic -rwxr-xr-x 1 ubuntu ubuntu 218424 Jul 24 20:14 dsaX_fake -rwxr-xr-x 1 ubuntu ubuntu 228888 Jul 24 20:14 dsaX_nicdb -rwxr-xr-x 1 ubuntu ubuntu 247208 Jul 24 20:14 dsaX_reorder_raw -rwxr-xr-x 1 ubuntu ubuntu 225048 Jul 24 20:13 dsaX_split -rwxr-xr-x 1 ubuntu ubuntu 235816 Jul 24 20:14 dsaX_writeFil -rwxr-xr-x 1 ubuntu ubuntu 4959488 Jul 24 20:14 dsaX_writevis -rwxr-xr-x 1 ubuntu ubuntu 243208 Jul 24 20:13 dsaX_xgpu
The Makefile at v1.0.0 depends on how the support code in the other dsa110 repos is installed. There's a shell script that MAAS uses to install these repos so using this script for local dev is one way forward but not the only way. If changes are made to these dsa110 specific repos and the version bumped, then the MAAS config needs to be kept in lock-step with these changes.
@rh-codebase this is excellent! You'll have to train us up on editing the MAAS config at some point. One thing - did you end up incorporating the network configuration stuff (e.g., MTU, sysctl.conf)?
yep. did the MTU sysctl stuff about a week ago. I'll look to see if MAAS reports these settings. I'm having an issue with one of the servers so trying to solve that one. Hope to start migrating containers over. A still big unknown is integrating the Labjacks. It should be straightforward but can't commit to it being a 1hr endeavor. I'm actually not sure we even need names for them in MAAS since I believe James has a scheme where they know how they are and report their monitor data correctly. MAAS reports the servers copper and fiber at 10G. I thought the copper arista switch reported 1G so I'll double-check that. I haven't done any benchmarking yet on BW.
Nice! Hope the LabJacks don't take too much time. Also, don't worry about that benchmarking of the 10G - we can deal with that down the track when it becomes essential. Something that's more urgent is getting the SNAPs to receive DHCP from MAAS. I'll send an e-mail to you and Mark about this.
The labjacks, if they dhcp properly should take no time. James confirmed he does not need DNS names for them, so as long as MAAS gives them an IP and they accept it, there should be no issues. Mark helped me assign the first 7 SNAPS to DNS names so we have: snap00 - snap06. One of the SNAPS fails to ask for a DCHP address. I believe we saw this in the bldg 7 lab environment as well. Mark has set it aside for you to re-program. I finally got lxd110h01 to boot up like the others. I discovered I need to tweak some of the install scripts for the other nodes. h00 is special and there are parts of it I need to transfer over to the more generic curtin config files for the rest of the nodes. Then copy and build the support containers(etcd, influx, grafana, wx). Then antenna/beb LJ can be hooked up.
Software and config. needed for the correlator container.