EUDAT-B2STAGE / B2STAGE-GridFTP

B2STAGE service core code for EUDAT project: iRODS-DSI
14 stars 15 forks source link

Corrupted read object #9

Closed muccix closed 9 years ago

muccix commented 9 years ago

I Vlad,

today by chance a colleague of mine has noticed that the file downloaded from iRODS through the GridFTP-DSI are corrupted: actually the dimension of the file is correct, but the file is empty.

That is really strange, because I am quite sure I did some test with Globus Online with the MD5 integrity check and I didn't notice any error.

Could you please check if you have the same problem?

Thanks a lot Roberto

muccix commented 9 years ago

Hi Vlad,

I think I managed to correct the problem ( I am a bit puzzled actually since I don't think anyone has touched that part and I'm quite sure it used to work...).

Before committing the code I want to spend some more time to clear my mind but, anyway, what I changed is:

typedef struct globus_l_iRODS_read_ahead_s
{
    struct globus_l_gfs_iRODS_handle_s *  iRODS_handle;
    globus_off_t                        offset;
    globus_size_t                       length;
    //globus_byte_t                       buffer[1]; <-- line commented
    globus_byte_t *                       buffer;  <-- line added
} globus_l_iRODS_read_ahead_t;

and in "globus_l_gfs_iRODS_read_ahead_next":

//bytesBuf_t DataObjReadOutBBuf;
    //dataObjReadOutBBuf.buf = rh->buffer; <-- line commented
    //dataObjReadOutBBuf.len = read_length;//rh->length; <-- line commented

    rh->length = rcDataObjRead (iRODS_handle->conn, &dataObjReadInp, &dataObjReadOutBBuf);
    if(rh->length <= 0)
    {  
        result = GLOBUS_SUCCESS; /* this may just be eof */
        goto attempt_error;
    }

    rh->buffer =  (globus_byte_t *)dataObjReadOutBBuf.buf; <-- line added
    dataObjReadOutBBuf.len = read_length;//rh->length; <-- line added

I'm looking forward to having your opinion on this!

Cheers Roberto

vladimir-mencl-eresearch commented 9 years ago

Hi Roberto,

I haven't run into this issue yet, but I haven't done any more intense testing yet.

I've started looking into the code, but today has been extremely busy.

I'll get back to you by the end of the week.

Cheers, Vlad

vladimir-mencl-eresearch commented 9 years ago

Hi Roberto,

I think I found it.

I have not been able to reproduce the corruption, but I found a place in the iRODS source code where it could happen and that is consistent with how you fixed it.

The first part of your change (to the globus_l_iRODS_read_ahead_t structure) is just syntactic - it should not really matter whether the buffer is allocated as part of the structure or whether it's referenced by a pointer.

The interesting part is how you interface with the rcDataObjRead function - whether you provide it an buffer as input parameter, or whether you take the buffer returned by the function.

I was looking into the API documentation, but it did not specifically say which of these ways of using it should be used - and apparently, it accepts both.

Digging through the iRODS (3.3.1) source code, diving deep in the call hierarchy (rcDataObjRead -> procApiRequest -> readAndProcApiReply -> readMsgBody), I eventually found that readMsgBody:

  1. would accept a buffer if it is provided.
  2. would allocate a new buffer if it is not provided
  3. would re-allocate the buffer (i.e., deallocate and allocate a new one) if it is provided but is not large enough.

I think the last one is what triggers the inconsistency. But I have not been able to trigger the bug - I don't know under which circumstances would the rcDataObjRead call return more data than how much was requested.

But, as the buffer pointer may change, the only safe way is to take the pointer only after the call to rcDataObjRead returns.

So please go ahead with your changes - I think my write up above gives a plausible explanation as to how the read corruption was happening and why this change fixed it.

(And, while making the change, please remember to change the line allocating the globus_l_iRODS_read_ahead_t * rh structure not to include the buffer (read_length) in the memory calculation).

Hope this helps!

Cheers, Vlad

muccix commented 9 years ago

Hi Vlad,

thanks for your comment; I have pushed the modification and created a new release. It would be nice if you could test it.

So, if I understand right, you are not experiencing the problem? This is really strange.

Are you using iRODS 4.0.3 or iRODS 4.1? We are currently using the 4.0.3 version...

Cheers, Roberto

vladimir-mencl-eresearch commented 9 years ago

Hi Roberto,

I wasn't getting the problem with 3.3.1.

I could reproduce this with 4.0.3 - I got a file filled just with zeros.

I've also looked into the source code and in 4.0.3, there is a slightly different code path, always allocating a new buffer - and thus triggering a bug. 4.1.x reverts to the originally behavior - reuse the buffer if it exists and is large enough.

I have confirmed your fix fixes it on 4.0.3.

Having looked into the source, I'm happy with explaining how it happened and I'm confident you have implemented the proper fix.

I think it's OK to close this issue now.

Cheers, Vlad

vladimir-mencl-eresearch commented 9 years ago

PS: And just for completeness, I've just checked also with 3.3.1 and it works as well (i.e., did not break anything :-)

Cheers, Vlad

muccix commented 9 years ago

Hi Vlad,

here I am again!

I'm still using iRODS 4.0.3 and, despite my fix, I found a situation in which the object is corrupted during ad upload.

It seems to happen using globus onlne in the case there is a restart of the gridftp server which then tries to resume the upload from the past point; here are some server logs:

ALLO 15936061440^M
[14871] Wed Jul  1 15:49:55 2015 :: smtp.globusonline.org:39340: [SERVER]: 200 ALLO command successful.^M
[14871] Wed Jul  1 15:49:55 2015 :: smtp.globusonline.org:39340: [CLIENT]: REST 0-10397941760^M
[14871] Wed Jul  1 15:49:55 2015 :: smtp.globusonline.org:39340: [SERVER]: 350 Restart Marker OK. Send STORE or RETR to initiate transfer.^M
[14871] Wed Jul  1 15:49:55 2015 :: smtp.globusonline.org:39340: [CLIENT]: STOR /CINECA01/home/cin_staff/rmucci00/DSI_Test/181000.tar

When this happen, the upload seems to work fine but the file is corrupted (it becomes corrupted as soon as the transfer restart).

I would like to try with iRODS 4.1.3 but I have problems with the GSI authentication: which irods_auth_plugin_gsi are you using? With the 1.1 I receive a segmentation fault, while building from source I receive the following error:

$ ils
DEBUG: Client side: GSS-API error initializing context: GSS Major Status: Communications Error
DEBUG: Client side: GSS-API error initializing context: GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with GSS token
globus_gsi_gssapi: Error with GSS token: The input token has an invalid length of: 0

ERROR: [-]      iRODS/lib/core/src/clientLogin.cpp:321:clientLogin :  status [GSI_ERROR_INIT_SECURITY_CONTEXT]  errno [] -- message []
        [-]     libgsi.cpp:593:gsi_auth_establish_context :  status [GSI_ERROR_INIT_SECURITY_CONTEXT]  errno [] -- message [Failed initializing GSI context. Major status: 0    Minor status: 1955654848]

ERROR: [-]      iRODS/lib/core/src/rcConnect.cpp:272:rcDisconnect :  status [SYS_HEADER_WRITE_LEN_ERR]  errno [Broken pipe] -- message []
        [-]     iRODS/lib/core/src/sockComm.cpp:1346:sendRodsMsg :  status [SYS_HEADER_WRITE_LEN_ERR]  errno [Broken pipe] -- message [failed to call 'write body']
                [-]     libtcp.cpp:420:tcp_send_rods_msg :  status [SYS_HEADER_WRITE_LEN_ERR]  errno [Broken pipe] -- message [writeMsgHeader failed]
                        [-]     iRODS/lib/core/src/sockComm.cpp:467:writeMsgHeader :  status [SYS_HEADER_WRITE_LEN_ERR]  errno [Broken pipe] -- message []
                                [-]     libtcp.cpp:341:tcp_write_msg_header :  status [SYS_HEADER_WRITE_LEN_ERR]  errno [Broken pipe] -- message [wrote 0 expected -1946157056]
vladimir-mencl-eresearch commented 9 years ago

Hi Roberto,

I think iRODS does not support multiple writers writing into the same data object (open with separate rcDataObjOpen calls). I'm guessing here, but my guess is this will not work neither for multiple parallel streams, nor for restarts.

I don't know whether there is a way to declare that this module does not support parallel transfers and restarts - either in how the module registers with the GridFTP server, or in `/etc/gridftp.conf settings recommended in the documentation. I tried having a quick look, but could not find anything....

For the file corruption you get after a restart, my guess is that the file is missing the data received in the initial transfer (and I'd expect it to have just zeroes there). Would you be able to confirm this?

Cheers, Vlad

PS: Regarding GSI: I'm now setting up an iRODS 4.1.3 server and GSI built from the git master HEAD is aborting when initializing server certificate. My workaround is to use username+password authentication as rods in the irods_environment.json file (and use the connectAsAdmin feature).

muccix commented 9 years ago

Hi Vlad,

I think the problem is caused by this line:

dataObjInp.openFlags = O_TRUNC | O_WRONLY;

It must be modified like this:

dataObjInp.openFlags = flags;

so when the file is opened again it is not truncated if not required.

I'm doing some more tests to see if it really solves the problem.

Cheers, Roberto