dmwm / DDM

Dynamic Data Management - Cache release and auto-replication of hot data
1 stars 9 forks source link

Possibly corrupted file not showing up in corruptedFiles #113

Closed aminnj closed 7 years ago

aminnj commented 7 years ago

Hi,

I've been submitting jobs to run on file [1] about 200 times over the past month, and each time, xrootd fails on that file with [2]. To check, the command [3] hangs locally. However, looking at https://cmsweb.cern.ch/popdb/popularity/corruptedFiles, I don't see this file (or any files, actually). Is this expected? The description on that page states

Given that I've been having this problem for a month now, I'd expect point 1 to be satisfied. And I guess from point 2, this API gathers data from CRAB. I'm not using CRAB myself, but I'd wager that other people are definitely running on this file as it is 2017 data.

Hope I'm not misunderstanding this API.

Thanks, Nick

[1] /store/data/Run2017C/MET/MINIAOD/PromptReco-v2/000/300/122/00000/6461A163-4877-E711-9937-02163E01441A.root

[2]

    Fallback Input file root://cmsxrootd.fnal.gov//store/data/Run2017C/MET/MINIAOD/PromptReco-v2/000/300/122/00000/6461A163-4877-E711-9937-02163E01441A.root also could not be opened.
    Original exception info is above; fallback exception info is below.
          [c] XrdCl::File::Open(name='root://cmsxrootd.fnal.gov//store/data/Run2017C/MET/MINIAOD/PromptReco-v2/000/300/122/00000/6461A163-4877-E711-9937-02163E01441A.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file.
    ' (errno=3011, code=400). No additional data servers were found.
          [d] Last URL tried: root://cms-xrd-global.cern.ch:1094//store/data/Run2017C/MET/MINIAOD/PromptReco-v2/000/300/122/00000/6461A163-4877-E711-9937-02163E01441A.root?tried=+1213cmsxrootd2.fnal.gov1213xrootd.unl.edu,
          [e] Problematic data server: cms-xrd-global.cern.ch:1094
          [f] Disabled source: cms-xrd-global.cern.ch:1094

[3] xrdfs root://cms-xrd-global.cern.ch ls /store/data/Run2017C/MET/MINIAOD/PromptReco-v2/000/300/122/00000/6461A163-4877-E711-9937-02163E01441A.root

cvuosalo commented 7 years ago

Hi Nick,

The CRAB Data Popularity corrupted files list should not be relied upon. I don't think it is populated.

Before you run jobs, you need to verify that the files your jobs will use exist on Tier-1 or Tier-2 disks. DAS is the tool to use. For your file, DAS reports that it doesn't exist anywhere:

https://cmsweb.cern.ch/das/request?view=list&limit=50&instance=prod%2Fglobal&input=site+file%3D%2Fstore%2Fdata%2FRun2017C%2FMET%2FMINIAOD%2FPromptReco-v2%2F000%2F300%2F122%2F00000%2F6461A163-4877-E711-9937-02163E01441A.root

This file is not corrupted; it looks like it is non-existent.

The owning dataset, /MET/Run2017C-PromptReco-v2/MINIAOD, is not complete anywhere:

https://cmsweb.cern.ch/das/request?input=site%20dataset%3D/MET/Run2017C-PromptReco-v2/MINIAOD&instance=prod/global&idx=0&limit=10

---Carl

On 09/03/2017 11:37 AM, Nick Amin wrote:

Hi,

I've been submitting jobs to run on file [1] about 200 times over the past month, and each time, xrootd fails on that file with [2]. To check, the command [3] hangs locally. However, looking at https://cmsweb.cern.ch/popdb/popularity/corruptedFiles, I don't see this file (or any files, actually). Is this expected? The description on that page states

  • List of files (per site) that ALWAYS failed in job accesses in the last 15 days.
  • Only CMSSW errors 8020 and 8021 are accounted for the job failures.
  • The table provides the number of access failures and the distinct days in which these failures did occur.
  • Only files with at least 5 failures or 3 distinct days of failures are summarized in the table.

Given that I've been having this problem for a month now, I'd expect point 1 to be satisfied. And I guess from point 2, this API gathers data from CRAB. I'm not using CRAB myself, but I'd wager that other people are definitely running on this file as it is 2017 data.

Hope I'm not misunderstanding this API.

Thanks, Nick

[1] /store/data/Run2017C/MET/MINIAOD/PromptReco-v2/000/300/122/00000/6461A163-4877-E711-9937-02163E01441A.root

[2]

|Fallback Input file root://cmsxrootd.fnal.gov//store/data/Run2017C/MET/MINIAOD/PromptReco-v2/000/300/122/00000/6461A163-4877-E711-9937-02163E01441A.root also could not be opened. Original exception info is above; fallback exception info is below. [c] XrdCl::File::Open(name='root://cmsxrootd.fnal.gov//store/data/Run2017C/MET/MINIAOD/PromptReco-v2/000/300/122/00000/6461A163-4877-E711-9937-02163E01441A.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file. ' (errno=3011, code=400). No additional data servers were found. [d] Last URL tried: root://cms-xrd-global.cern.ch:1094//store/data/Run2017C/MET/MINIAOD/PromptReco-v2/000/300/122/00000/6461A163-4877-E711-9937-02163E01441A.root?tried=+1213cmsxrootd2.fnal.gov1213xrootd.unl.edu, [e] Problematic data server: cms-xrd-global.cern.ch:1094 [f] Disabled source: cms-xrd-global.cern.ch:1094 |

[3] xrdfs root://cms-xrd-global.cern.ch ls /store/data/Run2017C/MET/MINIAOD/PromptReco-v2/000/300/122/00000/6461A163-4877-E711-9937-02163E01441A.root

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dmwm/DDM/issues/113, or mute the thread https://github.com/notifications/unsubscribe-auth/AFeG36QAFS1KO3ejSXcmQbgWTWyScJqfks5setXMgaJpZM4PLO7Y.

aminnj commented 7 years ago

Hi Carl,

Thanks for clarifying.

I finally found out the full story: as you say, this file doesn't exist. But, a few weeks ago, it did show up on DAS/DBS (when using validFilesOnly=1). This is no longer the case.

And looking back at my other jobs, I see one other file with the same story -- used to be in DBS, but not anymore.

Nick