PelicanPlatform / pelican

The Pelican Platform for creating data federations
https://pelicanplatform.org/
Apache License 2.0
13 stars 26 forks source link

`writes` not shown in origin records #1630

Open CannonLock opened 1 month ago

CannonLock commented 1 month ago

None of the currently running origins are reporting write for xrootd_transfer_bytes. This can be seen from the director metrics that are being reported with prometheus and it can be seen in the Elasticsearch logs shown below.

Noting in the reports below it looks like none of the Origins that I am familiar with are reporting write, and many are missing entirely from the reports themselves.

https://github.com/CannonLock/ES_queries/blob/master/osdf.ipynb

I have not confirmed that the issue is not in the aggregation layer. I don't have access to the reporting origins I know are writing to check if they know they are writing.

Assigning to Brian as it is not clear to me where this problem lies.

bbockelm commented 1 month ago

Adding this to Justin's plate for 7.12.

A few notes:

  1. XRootD creates a packet for each transfer; the format of the packet is here.
  2. Sometime after the transfer completes (we usually have this set to O(10s), the packet is sent over UDP to the Pelican process.
  3. Pelican parses the packet and adds it to a Prometheus counter (see the metrics subdirectory).

So, if we are missing write data, it's likely breaking in either step (1) or step (3). I'd suggest trying to bisect the issue - is it an xrootd problem or a Pelican problem? - by adding a few judicious logging statements to the packet processor and uploading data to a development origin.

patrickbrophy commented 3 weeks ago

My findings are the following:

So now I am going to investigate why exactly we are missing the writes but not the reads.

patrickbrophy commented 2 weeks ago

I think I may have found the bug. I used xrdcp to copy a file to the pelican xrootd server. This correctly trigged a write and was updated correctly in both xrootd_transfer_operations_count and xrootd_transfer_bytes. I also tested reads and that worked correctly as well. This leads me to believe that there may be an issue in the XrdHttp protocol implementation.