DUNE-DAQ / snbmodules

Implementations of file transfer service for SNB data files
1 stars 1 forks source link

Torrent based multi-file transfer unfinished #7

Open roland-sipos opened 9 months ago

roland-sipos commented 9 months ago

During the integration testing, the bookkeeper reported unfinished transfers for multiple-file transfers.

***** Bookkeeper localhostsnbbookkeeper localhost:5000 informations display *****
Connected clients :
    * Session localhostsnbclient0_sestransfer0 is active
         - output_localhosteth0_0.out   159386400 bytes from 10.73.136.79:5001  UPLOADING   105%    -Bi/s   06-10-2023 09:59:24 2935ms  N/A     
         - output_localhosteth0_1.out   159386400 bytes from 10.73.136.79:5001  UPLOADING   105%    -Bi/s   06-10-2023 09:59:27 606ms   N/A     
         - output_localhosteth0_2.out   159386400 bytes from 10.73.136.79:5001  UPLOADING   105%    -Bi/s   06-10-2023 09:59:24 3540ms  N/A     
    * Session localhostsnbclient1_sestransfer0 is active
         - output_localhosteth0_0.out   159386400 bytes from 10.73.136.79:5001  FINISHED    100%    -Bi/s   N/A 0ms 06-10-2023 09:59:25     
         - output_localhosteth0_1.out   159386400 bytes from 10.73.136.79:5001  DOWNLOADING 0%  -Bi/s   06-10-2023 09:59:25 1822ms  N/A     
         - output_localhosteth0_2.out   159386400 bytes from 10.73.136.79:5001  DOWNLOADING 0%  -Bi/s   06-10-2023 09:59:25 1720ms  N/A

On the other hand, the seed client correctly reports that every file was successfully uploaded once. Data files are also present on the destination client. but without the resume torrent file:

total 458M
   0 drwxr-xr-x 3 rsipos np-comp   23 Oct  6 09:59 ..
153M -rw-r--r-- 1 rsipos np-comp 153M Oct  6 09:59 output_localhosteth0_0.out
348K -rw-r--r-- 1 rsipos np-comp 347K Oct  6 09:59 .resume_file_output_localhosteth0_0.out
928K -rw-r--r-- 1 rsipos np-comp 925K Oct  6 09:59 bittorrent.log
153M -rw-r--r-- 1 rsipos np-comp 153M Oct  6 09:59 output_localhosteth0_1.out
   0 drwxr-xr-x 2 rsipos np-comp  177 Oct  6 09:59 .
153M -rw-r--r-- 1 rsipos np-comp 153M Oct  6 09:59 output_localhosteth0_2.out

This seems to indicate a problem how the downloader/leach client is finishing and reporting the transfers in case the transfer consists of multiple files.

roland-sipos commented 9 months ago

Indeed the problem comes from the transfer interface implementation of bittorrent, related to transfer finish. To be investigated:

Problem(s) found in logfile /tmp/rsipos/pytest-of-rsipos/pytest-0/run0/log_snbclient_4338.txt:
2023-Oct-06 09:59:25,304 WARNING [void dunedaq::snbmodules::TransferInterfaceBittorrent::do_work(std::atomic<bool>&) at /nfs/sw/rsipos/DUNE/Sept/snb-NFD23-10-03/sourcecode/snbmodules/src/common/transfer_interface_bittorrent.cpp:214] BittorrentPeerDisconnectedError: Peer disconnected output_localhosteth0_1.out peer [ 10.73.136.79:52191 client: libtorrent 2.0.9 ] disconnecting (TCP) [sock_read] [asio.misc]: End of file (reason: 0)

2023-Oct-06 09:59:26,316 WARNING [void dunedaq::snbmodules::TransferInterfaceBittorrent::do_work(std::atomic<bool>&) at /nfs/sw/rsipos/DUNE/Sept/snb-NFD23-10-03/sourcecode/snbmodules/src/common/transfer_interface_bittorrent.cpp:214] BittorrentPeerDisconnectedError: Peer disconnected output_localhosteth0_2.out peer [ 10.73.136.79:53395 client: libtorrent 2.0.9 ] disconnecting (TCP) [sock_read] [asio.misc]: End of file (reason: 0)
LJoyL commented 6 months ago

May be linked to m_done flag in do_work method. This flag stop the bittorrent client but is not handled properly. It looks like the first file finish before the others are added to the client, making it stop...

Possible fix: giving the number of files to transfer instead of counting them when added to trigger the stop of the client. See fork branch with possible fix here