dsa110 / dsa110-issues

Issue tracker for all DSA-110 work
0 stars 0 forks source link

Correlator container software and config. #5

Open rh-codebase opened 4 years ago

rh-codebase commented 4 years ago

Software and config. needed for the correlator container.

rh-codebase commented 4 years ago

I've verified that the cuda software packages downloaded from Nvidia's website is all that is needed to get cuda/gpu running on the host and in LXD containers. The root issue before is because the nvidia drivers and support daemons were not running on the host(or being loaded on boot). When installing the cuda software there is mention in the output to reboot the machine in order to get the drivers loaded. The other way(discovered accidentally) is to run nvidia-smi on the host after a fresh install. @VR-DSA reported that on lxd110master(with older cuda software) the kernel modules are not loaded on reboot. It doesn't look like the cuda software on lxd110master was installed via a package.

dsa@lxd110master:~$ apt show cuda N: Unable to locate package cuda N: Unable to locate package cuda E: No packages found

After the cuda software on the host is installed, then launching a LXD container with a profile setup to add the gpu device and install cuda works as well.

rebooting the main host does load the nvidia drivers and necessary daemon processes so all is working as it should now.

And MAAS is now able to provision all of this automagically but will be testing with more servers soon.

Cuda software being used on host:

ubuntu@lxd110h0:~$ apt show cuda Package: cuda Version: 10.2.89-1

VR-DSA commented 4 years ago

Here is a preliminary list of packages and their sources that need to be installed on the correlator container. There are also some configuration steps that need to be taken. Some work needs to be done to make the below a reality!

*** Configuration steps:

*** Installs from standard repositories of debs:

*** Python

*** Installs from specific sources

*** Installs of code repos that we've modified (and so should be on our github - TBD)

VR-DSA commented 4 years ago

An update to this issue: I've put some code in branch v0.9 of dsa110-xengine, which can be used to test compilation on the correlator container. We also need to find mbheimdall.

One thing to check is that the xgpu and psrdada versions match those on the dsa110master.ovro.pvt container.

rh-codebase commented 4 years ago

compilation failed since xgpu and psrdada are not yet in our repos and thus not installed in the container when provisioned. The TBD in the above comment sounds like they should be. Taking a look at dsa110master

rh-codebase commented 4 years ago

xGPU on dsa110master has a small change to the Makefile. Creating dsa110-xGPU to track local changes. moving the remote origin to upstream as typical convention. Creating new origin to point to dsa110 on github. To pull in future changes from upstream: git fetch upstream. Then rebase our changes on top of upstream's lastest stuff. Then see if this all builds in the container. @VR-DSA Do you want this built on the container's host as well?

VR-DSA commented 4 years ago

Thanks Rick! xGPU only needs to be on the container, not the host.

On Jul 23, 2020, at 8:09 AM, rh-codebase notifications@github.com<mailto:notifications@github.com> wrote:

xGPU on dsa110master has a small change to the Makefile. Creating dsa110-xGPU to track local changes. moving the remote origin to upstream as typical convention. Creating new origin to point to dsa110 on github. To pull in future changes from upstream: git fetch upstream. Then rebase our changes on top of upstream's lastest stuff. Then see if this all builds in the container. @VR-DSAhttps://github.com/VR-DSA Do you want this built on the container's host as well?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dsa110/dsa110-issues/issues/5#issuecomment-663061880, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALSPILHRFWFJUBI3PSKMHKLR5BHA3ANCNFSM4KTDYVJQ.

rh-codebase commented 4 years ago

compile error in container: ubuntu@corr00:~/proj/dsa110-shell/dsa110-xGPU/src$ make nvcc fatal : Value 'sm_30' is not defined for option 'gpu-architecture' Makefile:199: recipe for target 'cuda_correlator.o' failed

The conainter sees the following GPUs ubuntu@corr00:~/proj/dsa110-shell/dsa110-xGPU/src$ nvidia-debugdump --list Found 2 NVIDIA devices Device ID: 0 Device name: GeForce RTX 2080 Ti GPU internal ID: GPU-cb4ab609-d0de-fab1-b368-f4b448a91763

Device ID:              1
Device name:            GeForce RTX 2080 Ti
GPU internal ID:        GPU-c71d7823-e96c-ce09-6bfd-ff52f6443ab8

which are the same cards as on lxd110master(the host for dsa110master). Hmmm. This link is useful. https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ Change CUDA_ARCH in Makefile to sm_75 as per link above. Compile succeeds. Don't understand how this compiled on dsa110master unless a change wasn't committed? Ideas?

VR-DSA commented 4 years ago

Easy :). "make" won't cut it.

make CUDA_ARCH=sm_75 NPOL=2 NSTATION=64 NFREQUENCY=768 NTIME_PIPE=128 NTIME=2048

After this succeeds, the output of the command "xgpuinfo" (in the src dir) should be exactly: xGPU library version: 2.0.0+dirty Number of polarizations: 2 Number of stations: 64 Number of baselines: 2080 Number of frequencies: 768 Number of time samples per GPU integration: 2048 Number of time samples per transfer to GPU: 128 Type of ComplexInput components: 8 bit integers Type of computation: FP32 multiply, FP32 accumulate Number of ComplexInput elements in GPU input vector: 201326592 Number of ComplexInput elements per transfer to GPU: 12582912 Number of Complex elements in GPU output vector: 6389760 Number of Complex elements in reordered output vector: 6389760 Output matrix order: triangular Shared atomic transfer size: 4 Complex block size: 1

rh-codebase commented 4 years ago

Indeed it does. ubuntu@corr00:~/proj/dsa110-shell/dsa110-xGPU/src$ ./xgpuinfo xGPU library version: 2.0.0+1@gf582271-dirty Number of polarizations: 2 Number of stations: 64 Number of baselines: 2080 Number of frequencies: 768 Number of time samples per GPU integration: 2048 Number of time samples per transfer to GPU: 128 Type of ComplexInput components: 8 bit integers Type of computation: FP32 multiply, FP32 accumulate Number of ComplexInput elements in GPU input vector: 201326592 Number of ComplexInput elements per transfer to GPU: 12582912 Number of Complex elements in GPU output vector: 6389760 Number of Complex elements in reordered output vector: 6389760 Output matrix order: triangular Shared atomic transfer size: 4 Complex block size: 1

Since we are modifying the Makefile for DSA110, is there a reason not to include these variables so that make just works? Or do you play with these numbers often?

VR-DSA commented 4 years ago

Well, they will be modified at some point, but not for the next several months. Up to you!

On Jul 23, 2020, at 10:03 AM, rh-codebase notifications@github.com<mailto:notifications@github.com> wrote:

Indeed it does. ubuntu@corr00:~/proj/dsa110-shell/dsa110-xGPU/src$ ./xgpuinfo xGPU library version: 2.0.0+1@gf582271-dirty Number of polarizations: 2 Number of stations: 64 Number of baselines: 2080 Number of frequencies: 768 Number of time samples per GPU integration: 2048 Number of time samples per transfer to GPU: 128 Type of ComplexInput components: 8 bit integers Type of computation: FP32 multiply, FP32 accumulate Number of ComplexInput elements in GPU input vector: 201326592 Number of ComplexInput elements per transfer to GPU: 12582912 Number of Complex elements in GPU output vector: 6389760 Number of Complex elements in reordered output vector: 6389760 Output matrix order: triangular Shared atomic transfer size: 4 Complex block size: 1

Since we are modifying the Makefile for DSA110, is there a reason not to include these variables so that make just works? Or do you play with these numbers often?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dsa110/dsa110-issues/issues/5#issuecomment-663121962, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALSPILE3ZHJLR2FSVU355JDR5BUOBANCNFSM4KTDYVJQ.

rh-codebase commented 4 years ago

makes more sense to me to track in the dsa110-xGPU repo instead of the maas repo. I'll update the repo. dsa110-xGPU is now successfully building and installing(/usr/local) in the containers:

root@corr00:/home/ubuntu/proj/dsa110-shell/dsa110-xGPU/src# xgpuinfo xGPU library version: 2.0.0+1@gf582271 Number of polarizations: 2 Number of stations: 64 Number of baselines: 2080 Number of frequencies: 768 Number of time samples per GPU integration: 2048 Number of time samples per transfer to GPU: 128 Type of ComplexInput components: 8 bit integers Type of computation: FP32 multiply, FP32 accumulate Number of ComplexInput elements in GPU input vector: 201326592 Number of ComplexInput elements per transfer to GPU: 12582912 Number of Complex elements in GPU output vector: 6389760 Number of Complex elements in reordered output vector: 6389760 Output matrix order: triangular Shared atomic transfer size: 4 Complex block size: 1

PSRDADA next to get integrated into our repo and built. Most likely there will be changes to the Makefile in dsa110-xengine to look for include directories in these newly installed areas.

rh-codebase commented 4 years ago

dsa110-psrdada compiles in container. dsa110-xengine is failing to link. can't find sigproc. On dsa110master, /usr/local/sigproc is a git repo with local modifications. psrdada though also contains a sigproc dir. Which one to use? If /usr/local/sigproc then I'll create a dsa110-sigproc and go from there. Also, cuda is installed via apt package manager. On dsa110master version 10.2 was used. In the container, 11.2 I believe is used. Will this matter?

rh-codebase commented 4 years ago

on dsa110master in /usr/local/sigproc/src repo there's an untracked file fake_frb.c which looks to be a copy of fake.c but with 2 line changes. Should this file be tracked? There's no corresponding .o

VR-DSA commented 4 years ago

Interesting! The sigproc that should be used is /usr/local/sigproc on dsa110master. The version of cuda is an interesting one. I don’t think it should matter, but there’s a good chance it will. But in any case it’s best to use 11.2 rather than 10.2, so if something breaks let me know and I can try to fix it.

fake_frb.c should probably be tracked as well in sigproc.

Thanks for sorting through all this!

On Jul 23, 2020, at 6:53 PM, rh-codebase notifications@github.com<mailto:notifications@github.com> wrote:

on dsa110master in /usr/local/sigproc/src repo there's an untracked file fake_frb.c which looks to be a copy of fake.c but with 2 line changes. Should this file be tracked? There's no corresponding .o

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dsa110/dsa110-issues/issues/5#issuecomment-663314271, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALSPILH3GRKRPXRANSK3ASDR5DSTPANCNFSM4KTDYVJQ.

rh-codebase commented 4 years ago

I have a working container install of the dsa110-xengine repo. root@corr00:/home/ubuntu/proj/dsa110-shell/dsa110-xengine/src# ls -l | grep rwx -rwxr-xr-x 1 ubuntu ubuntu 289336 Jul 24 20:13 dsaX_beamformer -rwxr-xr-x 1 ubuntu ubuntu 297872 Jul 24 20:14 dsaX_capture -rwxr-xr-x 1 ubuntu ubuntu 225160 Jul 24 20:14 dsaX_dbnic -rwxr-xr-x 1 ubuntu ubuntu 218424 Jul 24 20:14 dsaX_fake -rwxr-xr-x 1 ubuntu ubuntu 228888 Jul 24 20:14 dsaX_nicdb -rwxr-xr-x 1 ubuntu ubuntu 247208 Jul 24 20:14 dsaX_reorder_raw -rwxr-xr-x 1 ubuntu ubuntu 225048 Jul 24 20:13 dsaX_split -rwxr-xr-x 1 ubuntu ubuntu 235816 Jul 24 20:14 dsaX_writeFil -rwxr-xr-x 1 ubuntu ubuntu 4959488 Jul 24 20:14 dsaX_writevis -rwxr-xr-x 1 ubuntu ubuntu 243208 Jul 24 20:13 dsaX_xgpu

The Makefile at v1.0.0 depends on how the support code in the other dsa110 repos is installed. There's a shell script that MAAS uses to install these repos so using this script for local dev is one way forward but not the only way. If changes are made to these dsa110 specific repos and the version bumped, then the MAAS config needs to be kept in lock-step with these changes.

VR-DSA commented 4 years ago

@rh-codebase this is excellent! You'll have to train us up on editing the MAAS config at some point. One thing - did you end up incorporating the network configuration stuff (e.g., MTU, sysctl.conf)?

rh-codebase commented 4 years ago

yep. did the MTU sysctl stuff about a week ago. I'll look to see if MAAS reports these settings. I'm having an issue with one of the servers so trying to solve that one. Hope to start migrating containers over. A still big unknown is integrating the Labjacks. It should be straightforward but can't commit to it being a 1hr endeavor. I'm actually not sure we even need names for them in MAAS since I believe James has a scheme where they know how they are and report their monitor data correctly. MAAS reports the servers copper and fiber at 10G. I thought the copper arista switch reported 1G so I'll double-check that. I haven't done any benchmarking yet on BW.

VR-DSA commented 4 years ago

Nice! Hope the LabJacks don't take too much time. Also, don't worry about that benchmarking of the 10G - we can deal with that down the track when it becomes essential. Something that's more urgent is getting the SNAPs to receive DHCP from MAAS. I'll send an e-mail to you and Mark about this.

rh-codebase commented 4 years ago

The labjacks, if they dhcp properly should take no time. James confirmed he does not need DNS names for them, so as long as MAAS gives them an IP and they accept it, there should be no issues. Mark helped me assign the first 7 SNAPS to DNS names so we have: snap00 - snap06. One of the SNAPS fails to ask for a DCHP address. I believe we saw this in the bldg 7 lab environment as well. Mark has set it aside for you to re-program. I finally got lxd110h01 to boot up like the others. I discovered I need to tweak some of the install scripts for the other nodes. h00 is special and there are parts of it I need to transfer over to the more generic curtin config files for the rest of the nodes. Then copy and build the support containers(etcd, influx, grafana, wx). Then antenna/beb LJ can be hooked up.