Open rtobar opened 4 years ago
After reading some more code, I think I understand better what is going on, and how to go about it. The problem was indeed that bbcp tries very hard to use each node's hostname as the main bit of information for establishing the data channels. I just tried some changes to bbcp locally and I could get use the interface I wanted (I tried in my laptop contacting NGAS through the network interface, which started bbcp and had it exchange data through the same interface). I will now expose those changes but through a new option (so old behavior is left unchanged), and will try again. Stay tuned...
I've implemented a new -j
command-line option in bbcp that should make it prefer using the hostname/IPs given in the file specifications in the command line instead of the hostnames of the nodes involved in the data exchange. Again, I tried this locally in my laptop by forcing bbcp to use my ethernet interface for the data exchange instead of the loopback interface, and it seems to work.
@gsleap could you give this a try when you have some time? Make sure you have the latest master
version of bbcp from https://github.com/ICRAR/bbcp in both machines. Then go with the following on mwacache10 (has -j
, but doesn't have -z
):
bbcp -j -f -V -n -S "ssh -x -a -oBatchMode=yes -oGSSAPIAuthentication=no -oFallBackToRsh=no %4 %I -l %U %H bbcp" -e -E c32c -s 12 -P 2 mwa@192.168.120.204:/data/20191210/rawdump_1260043216.raw 192.168.120.110:/home/mwa/NGAS/volume2/staging/NGAMS_TMP_FILE___3airu2emrawdump_1260043216.raw.fits
Hopefully this will take us in the right direction. I experienced some slowness while the connection was actually being established between the SRC and SNK copies of bbcp, and while I didn't stop to find out what was causing it I'm hoping it's something more to do with my setup and environment, and the fact that both copies are in the same computer in my case, than with the actual changes I did to the software.
Hi Rod,
Awesome, thanks for that. That did the trick- the bbcp was successful and without any real effort, achieved sustained speed of ~ 6 Gbps!
The only issue (well not issue, more of a nit pick) is that the final stats output shows: Target 127.0.1.1 using a final recv window of 3137568 Source 127.0.1.1 using a final send window of 6256640
(using 127.0.1.1 which is ubuntu's own internal DNS ip, rather than the IP's involved in the transfer)
No big deal though.
Thanks again!
Greg
From: rtobar notifications@github.com Sent: Wednesday, 12 February 2020 1:53 PM To: ICRAR/ngas ngas@noreply.github.com Cc: Greg Sleap greg.sleap@curtin.edu.au; Mention mention@noreply.github.com Subject: Re: [ICRAR/ngas] Investigate how/if bbcp allows to select network paths for data transfers (#21)
I've implemented a new -j command-line option in bbcp that should make it prefer using the hostname/IPs given in the file specifications in the command line instead of the hostnames of the nodes involved in the data exchange. Again, I tried this locally in my laptop by forcing bbcp to use my ethernet interface for the data exchange instead of the loopback interface, and it seems to work.
@gsleaphttps://github.com/gsleap could you give this a try when you have some time? Make sure you have the latest master version of bbcp from https://github.com/ICRAR/bbcp in both machines. Then go with the following on mwacache10 (has -j, but doesn't have -z):
bbcp -j -f -V -n -S "ssh -x -a -oBatchMode=yes -oGSSAPIAuthentication=no -oFallBackToRsh=no %4 %I -l %U %H bbcp" -e -E c32c -s 12 -P 2 mwa@192.168.120.204:/data/20191210/rawdump_1260043216.raw 192.168.120.110:/home/mwa/NGAS/volume2/staging/NGAMS_TMP_FILE___3airu2emrawdump_1260043216.raw.fits
Hopefully this will take us in the right direction. I experienced some slowness while the connection was actually being established between the SRC and SNK copies of bbcp, and while I didn't stop to find out what was causing it I'm hoping it's something more to do with my setup and environment, and the fact that both copies are in the same computer in my case, than with the actual changes I did to the software.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ICRAR/ngas/issues/21?email_source=notifications&email_token=AE2L5FU7H5VUAYWNDHFFZ33RCOFGHA5CNFSM4KO6V2TKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELPQWGQ#issuecomment-585042714, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE2L5FTFIQSY6F7ZA4IZMJTRCOFGHANCNFSM4KO6V2TA.
That's great news! I'll then put some effort into getting these changes upstreamed to the original bbcp maintainer, and to make the corresponding changes in NGAS to ensure we use the -j
option when available.
During the work done in #19 it was found by @gsleap that no matter the combination of parameters given to
bbcp
, it apparently always resorted to using hostnames (and fully qualified domain names) as the sole basis for establishing the connection between the source and the target nodes. This is not enough for certain scenarios, where two machines have multiple, independent networks paths that can be taken depending on the interface being addressed.Consider the following scenario, which is similar to the deployment used in the tests described in #19:
The NGAS server running in
B
is listening oneth1
, the 10 Gb interface. When theBBCPARC
command comes in fromA
we generate and execute this command inB
:bbcp .... 2.2.2.2:/path/to/source/file 2.2.2.3:/path/to/ngas/staging/file
By using the specific IPs in the source and target specifications we expect
bbcp
to explicitly use the 10 Gb interface for the data transfer.The command starts the
SRC
andSINK
copies ofbbcp
inA
andB
respectively. bbcp however seems to use exclusively the hosts' names as the main bit of information to establish the connection betweenSRC
andSINK
, and becauseA
andB
resolve to the1.1.1.X
addresses, the 1Gb link is used for the bbcp data transfers. This behavior seems to be same regardless of the direction of the connection establishment (i.e. the-z
option) and whether name resolution (i.e., the-n
option) is used, but this should be tested thoroughly.This problem was initially investigated in #19, but then it was decoupled into a new issue to separate it from the original problem reportedin #19, which has been fixed.