haozturk commented 1 year ago

Impact of the bug Workflow announcement

Describe the bug We see lots of successful workflows w/ missing output on DBS. I'm not sure whether this is due to delay or failure in DBS injection or a problem in job failure accounting.

How to reproduce it Here are some affected workflows:

If you need a complete list, please let me know. I can provide it.

Expected behavior If there is no job failure, we expect to see 100% output on DBS.

Additional context and error message None

amaltaro commented 1 year ago

@vkuznet Valentin, can you please investigate this issue?

A short summary of steps to debug it is: 1) Hasan provided the workflow PrepId, so we need to open one of those links to find out the precise workflow name. If we do so for the first link, here is the workflow name: pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960 2) then open WMStats and filter the output for the workflow above. Click in the "all requests" button to bring you a table of workflow(s) 3) next, click on the first column D (grey/red box). It gives you a short summary of the workflow, and the information we are looking for is to know which agents worked on this workflow. For the workflow above, this is what I see agent url: cmsgwms-submit6.fnal.gov i) this same summary information provides us with a list of the output datasets. In this case, here it is: /LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM 4) now that we know the agent that processed this workflow and which output(s) is expected, you can ssh to submit6 and debug it. 5) I suggest to start looking at the DBS3Upload ComponentLog and see if we can find blocks failing to be inserted 6) another check is to verify the state of such blocks in the local database, mainly looking at the dbsbuffer_block table. 7) of course, make sure DBS3Upload is up and running.

I can help you once you get to this, but this is a brief set of instructions to get started with this and it might be enough to find the problem.

vkuznet commented 1 year ago

I can start looking into this issue next week as I'm busy today and early next week with doctor appointments. Meanwhile, it would be useful if @amaltaro will provide which exact URL calls DBS3Upload do and does it log these calls to DBS. From what I read I see two possibilities here:

we either have failed call to DBS from DBS3Upload, or
block already exists and something internally goes wrong in DBS3Upload

The problem will be much easier to debug if for given workflow we'll identify the URL call DBS3upload makes to DBS and see its parameters, i.e. if we can know file/block which is supplied via HTTP call we can easily check if it is an issue with DBS or not.

amaltaro commented 1 year ago

@vkuznet DBS3Upload uses only 1 API, which is insertBulkBlock, see: https://github.com/dmwm/WMCore/blob/2.1.2.patch2/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L92

you can also see what is logged around this code and check the logs of the relevant agent (once you figure which agent it is, please see instructions above).

As you know, the block information can be extremely large, so we definitely do not log and/or dump it anywhere. There is a way to dump it though, but I'd suggest to take the first debugging steps without the whole information that is passed to the server (other than the block name).

Given that you decided to be more verbose and explicit on the exception hit in the server, it might be that an analysis of the error message is enough to figure what the problem is.

todor-ivanov commented 1 year ago

Hi @vkuznet @amaltaro I'll take a look at this as well, as we discussed during the meeting yesterday.

vkuznet commented 1 year ago

From submit6 agent ComponentLog:

2022-09-20 20:32:57,233:139722709800704:ERROR:DBSUploadPoller:Error trying to process block /NMSSM_XToYHTo2G2WTo2G2L2Nu_MX-2600_MY-60_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM#ae6aeb11-6f20-42e6-9e7f-5ac952f3f118 through DBS. Error: DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set
Traceback (most recent call last):
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.2.patch1/lib/python3.8/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 92, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.10/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 630, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.10/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 464, in __callServer
    self.__parseForException(http_error)
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.10/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 508, in __parseForException
    raise http_error
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.10/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 461, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.7-comp2/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/data/srv/wmagent/v2.1.2.patch1/sw/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.7-comp2/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 62, in __call__
    raise HTTPError(effective_url, http_code, http_response.msg, http_response.raw_header, http_response.body)
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: Bad Request

What it actually means is that provided JSON has insufficient information to insert a specific info. The insertBulkBlock API expect a JSON. This JSON contains dataset/block/file config information, release information, etc. The error we see Error: sql: no rows in result set suggests that some aux info is not found in DBS database, e.g. dataset/block configuration and without it present in DB insert bulk block will fail.

In order to answer what is missing we need an actual JSON of a specific failed block. My question is how to get this JSON?

The recent error I see from ComponentLog is dated by

2022-11-08 08:54:24,174:139796326827776:ERROR:DBSUploadPoller:Error found in multiprocess during process of block /DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#397d4e07-e918-4c6d-ab5b-b074f1b83ee2

but I do not see block with ID 397d4e07-e918-4c6d-ab5b-b074f1b83ee2 anywhere in dbs2go-global-w logs on that date.

SO, there are several questions here:

how to get JSON of the failed block
which DBS instances DBS3Upload is using that we can try to see the request in DBS logs

vkuznet commented 1 year ago

From the DBS logs (from cmsweb-prod of today), I see a few datasets which failed to be injected via bulkblocks API and all of them points to the same issue in a code:

/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM
/NMSSM_XToYHTo2B2WTo2B2L2Nu_MX-300_MY-60_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16RECO-106X_mcRun2_asymptotic_v13-v2/AODSIM
/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM

The codeline which fails to insert them is here: https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L917 and it tells me that the supplied JSON points contains the parent which is not found in DBS database, please see entire block logic: https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L911-L926

Therefore, if it is the same issue I bet that whoever who construct the JSON input did not check that parents of these datasets/blocks are present in DBS. Again, without actual JSON it is hard to pin-point further which parents are not present. But I think we have a clear clue of what is going on with injections.

I still feel weak and can't spend more time on this, but I hope @todor-ivanov can dig more deeply into this issue.

For the records DBS logs can be found on vocms0750 and today's logs are

/cephfs/product/dbs-logs/dbs2go-global-w-684697bc8b-4crnd.log-20221108
/cephfs/product/dbs-logs/dbs2go-global-w-684697bc8b-wlnhc.log-20221108
/cephfs/product/dbs-logs/dbs2go-global-w-684697bc8b-x25rj.log-20221108

amaltaro commented 1 year ago

@vkuznet Valentin, DBS3Upload component provides a way to dump the block information in a file (json), but that required further work to become useful. I created this PR: https://github.com/dmwm/WMCore/pull/11361

to change how we decide to dump the block data for a given block name, such that now we can dump information for a single block and inspect what exactly gets posted to DBSServer.

I have already patched submit6 with that patch, changed the component configuration with the block name you mentioned above (pasted here as well):

/NMSSM_XToYHTo2G2WTo2G2L2Nu_MX-2600_MY-60_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM#ae6aeb11-6f20-42e6-9e7f-5ac952f3f118

and restarted DBS3Upload. Once you have the data you need, please revert that config back to an empty string and restart the component.

vkuznet commented 1 year ago

Alan, thanks for providing this but so far I do not see anything on submit6 agent. If I read you changes correctly and reading config, I see that the data should go to /storage/local/data1/cmsdataops/srv/wmagent/v2.1.2.patch1/install/wmagentpy3/DBS3Upload, right? So far this area does not have dbsuploader_block.json (this file name I read from PR). It can be either agent did not yet retried the given block or may be anything else. I think I still need you to look at this agent to see if everything is properly set.

amaltaro commented 1 year ago

Yes, that agent seems to have an unusually high number of open blocks (or not, vocms0253 has 15919 blocks!!!):

2022-11-11 03:54:08,603:139930855028480:INFO:DBSUploadPoller:Starting the DBSUpload Polling Cycle
2022-11-11 03:54:15,012:139930855028480:INFO:DBSUploadPoller:Found 9356 open blocks.

that's why it's taking a while to associate files to their respective blocks. The first cycle is usually longer than the subsequent ones.

You are correct, that's the correct place where you should see the json file.

amaltaro commented 1 year ago

@vkuznet I just figured why the json dump never showed up for the block above, and the reason is that that block went through the system long ago (it had only a transient failure), sorry about that.

It looks like we have another block that has been failing over the last few cycles of DBS3Upload. So I updated the component config to get a dump for:

config.DBS3Upload.dumpBlockJsonFor = "/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#397d4e07-e918-4c6d-ab5b-b074f1b83ee2"

vkuznet commented 1 year ago

@amaltaro , I got the JSON file for /DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#397d4e07-e918-4c6d-ab5b-b074f1b83ee2 and upon JSON inspection as I was expecting it has a problem with the following section:

  "dataset_parent_list": [
    "/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM"
  ]

I just checked DBS global database directly using and it has no such datasets registered, see

select * from cms_dbs3_prod_global_owner.datasets where dataset='/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM';

no rows selected

So, the problem, at least with this JSON and this block, is lack of parent dataset in DBS. Since I do not know how this JSON is constructed I do not have any further comments, but I suggest to figure it out and see why we create a JSON (block) with parent info which is not present in DBS DB.

amaltaro commented 1 year ago

Thank you for pin-pointing this issue, Valentin.

Based on this information, I investigated it a little further and it looks like we lost the ability of spotting such problems in DBS3Upload with the migration of DBS to the Golang implementation, as expected in these 2 lines of code: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L102-L103

Now on why and how it happened, we can look this up in WMStats by:

providing either the MINIAOD or NANOAOD dataset name in the WMStats GUI "output dataset" filter box
which brings us to this workflow: https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_SMP-RunIISummer20UL16MiniAODv2-00275__v1_T_221103_122546_3035
which has this request/spec type: "RequestType": "StepChain"

as such, StepChain workflows produce the unmerged files at one point and the merge is triggered by dataset + output module + datatier, asynchronously (actually they depend on many other constraints like size, events, etc).

So it's indeed possible to have a dataset/block to be injected into DBS server without having its parent data.

I don't know how to properly resolve it, given that an ACDC workflow can be executed to recover the MINIAOD data, thus eventually succeeding the insertion of this NANOAOD block into DBS server. At the moment though, I can think of two actions that would need to be taken:

@haozturk to investigate why ACDC workflows are not executed. There are job failures up in WMStats, so ideally we should run ACDC(s) and recover those lumis.
regarding DBS server, is there any way we can instrument the server such that we can clearly see that the parent information is missing, hence the current operation cannot be completed?

vkuznet commented 1 year ago

@amaltaro , this seems to be identical issue as I reported over here: https://github.com/dmwm/WMCore/issues/11106 In other words, the order to inject data into DBS matters since dataset/blocks/files info depends on other information such as configs, parentage, etc. And, resolution requires careful inspection of data injection procedure in WMCore.

Said that, indeed, these lines https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L102-L103 rely on DBS python exception logic rather checking actual DBS server error. Therefore, our effort of properly handling DBSErrors https://github.com/dmwm/WMCore/issues/10962, https://github.com/dmwm/WMCore/pull/11173/ also matter here.

What currently can be done is the following:

we need to patch DBSUploadPoller to include an additional block to check for DBSError in exception message.

if you need precise knowledge when parentage fails, we may scan returned exception (JSON) for the line unable to find dataset_id for message, here is an example from dbs2go log:

DBSError Code:109 Description:DBS DB ID error for provided entity, e.g. there is no record in DB for provided value Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:110b358e8d7a9864961e2b9f1ba069f7c314fd4e76863206b783bde269d84658 unable to find dataset_id for /NMSSM_XToYHTo2B2WTo2B2Q1L1Nu_MX-1600_MY-700_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16RECO-106X_mcRun2_asymptotic_v13-v2/AODSIM, error DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set Stacktrace`

you may decide to raise priority for DBSError issue and incorporate these changes into WMA
you may add an additional check to DBSUploadPoller to query DBS for dataset parents, in fact it is much better solution since DBSUploadPoller have JSON, it can extract dataset parents and place queries to DBS to see if such datasets exists. This step can be added in addition to aforementioned exception parsing one

I think the most reliable solution should be adding into DBSUploadPoller a code which will query DBS for dataset parents. The reason is simple, it is the most explicit action. The code should not rely on matching specific exception string from DBS server as server implementation can change and exception message too. Instead, the poller should explicitly check if data is allowed to be inserted. For that it should check the following:

check dataset parents, via /dataset?dataset=<dataset_name> DBS API
check dataset, file configuration, via /outputconfigs? (and provide either dataset or logical_file_name as input parameter) to ensure that output configuration is injected in DBS too. After that DBSUploadPolller can inject block info. This will eliminate possibility of data racing reported here #11106

amaltaro commented 1 year ago

We can implement it in a "permission vs forgiveness" model, up to now we have adopted the forgiveness one (deal with exceptions). The reason for this is:

before we inject a block, we need to ensure that the dataset exists in DBS
if the block has a parent dataset/block, we also need to check whether the parent dataset exists in DBS
the polling cycle of DBS3Upload is 3min, so we would have to repeat all of it every cycle, for every block. Luckily, we are clever enough to:
- group dataset and parent dataset
- use some sort of in-memory cache to avoid unneeded RESTful calls; or
- apply a database schema change such that we have a flat specifying whether a dataset has been injected or not in DBS.

We could try to look into this, but this becomes a reasonably more expensive development. A similar model is used in RucioInjector btw, and with the problems that we have been getting there, IMO the best and the most robust way forward would be to add database schema changes and properly store the state of datasets injections (both in DBS and Rucio).

vkuznet commented 1 year ago

@amaltaro , everything really depends on implementation. If code will use sequential mode, i.e. make DBS calls for configs, dataset, blocks, etc. in a sequence, of course it will be slow. But if code will implement making this calls in parallel then at most you will add 1-2 sec to the over time, not more. The checks which I suggested to do before data inject are very simple DBS queries since the query contains specific value, e.g. /datasets?dataset=/a/b/c is very fast query, similar /outputcofigs?dataset=/a/b/c. So if you'll make N calls in parallel, the overall overhead is the longest query which I doubt should be more than 1sec. Also, in this scenario there is no need for cache since cache will happen on ORACLE side which will prepare index and if someone will request the same query over and over again it will be originated from ORACLE cache. I certainly do not know Rucio implementation but I doubt they do parallel calls either. Since our pycurl_manager support that what is required is to prepare series of URLs and get them together. You should start taking advantage of fast Go server and apply concurrency model more often in WMCore codebase.

amaltaro commented 1 year ago

What I meant with more expensive development was that we go away from a simple block dump provided to the server towards much more control and checks on the client side:

thus, creating more HTTP requests (I am very much adept of NOT making them if not really needed). Relying on server cache is a bad practice in my opinion as it generates more network traffic, more frontend and backend threads, more backend resources, etc
creating more checks in the component
when running concurrent http calls, we need to have proper error handling and maybe some retry logic
more lines of code, thus eventually more maintenance and complexity in debugging (and places to have a bug).

If we take this path, then I think the best would be to make this stateful (dependent on the local database information).

vkuznet commented 1 year ago

if you serious about eliminating data race conditions then it is the only way, i.e. client should check everything before injection. The implementation details can vary though. If you need cache you can use it but it implies that you need to maintain it, and use resources for it. Making HTTP calls is cheap in my view but of course it has its own drawback (everything depends on data volume, rate, etc.). Retry or not again is up to a client. The bottom line client should perform all checks before trying to inject the new data. If this is implemented then we will eliminate data racing condition issue which is present right now and issue with failures due to not existing data in DB (this one). The only failure may happen in this scenario are due to database and/or DBS server issue, but in this case we'll have strong separation of responsibilities between the client (which performed all necessary checks) and server which is provided with fully valid input and should insert the data.

amaltaro commented 1 year ago

Yes, given that we found this to actually be a problem with the workflow itself, in the sense that some merge jobs failed and it caused one of the parent datasets not to be created (now pending ACDC creation and execution), I would suggest to close this one out (once we hear back from Hasan). At some point, that/those blocks should succeed once the ACDC goes through and (parent) data gets properly produced.

On what concerns the long term and major refactoring of DBS3Upload, I would suggest to update the GH issue https://github.com/dmwm/WMCore/issues/11106 with some of the comments made in this issue (my preference at this very moment is to include a database schema change) and plan it for the future quarters.

vkuznet commented 1 year ago

fine with me.

haozturk commented 1 year ago

Hi all, apologies for late reply. We check the following endpoint [1] which I think should include the recovery info that ACDC will use. But it's empty for these workflows. That's why we didn't ACDC them

[1] https://cmsweb.cern.ch/couchdb/acdcserver/_design/ACDC/_view/byCollectionName?key=%22pdmvserv_task_EXO-RunIISummer20UL17NanoAODv9-02666__v1_T_221003_184201_4960%22&include_docs=true&reduce=false

amaltaro commented 1 year ago

@haozturk I think you looked into the wrong workflow. A few replies above, I investigated the following workflow https://cmsweb.cern.ch/couchdb/acdcserver/_design/ACDC/_view/byCollectionName?key=%22cmsunified_task_SMP-RunIISummer20UL16MiniAODv2-00275__v1_T_221103_122546_3035%22&include_docs=true&reduce=false

where we can actually find ACDC entries for the merge task (among others).

haozturk commented 1 year ago

@amaltaro himm, that's not one of the workflows for which we opened this issue. It's not a "Successful (no job failure) workflow w/ missing output on DBS". If you check the workflows that I provided in the issue body, they don't have ACDC entries.

Regarding that workflow in particular, it has a filemismatch problem where an output file [1] is missing on DBS while it's on Rucio. We don't create ACDCs before fixing the mismatch. Do you think these issues are related?

I haven't read the whole thread. I might be missing some context.

[1] /store/mc/RunIISummer20UL16NanoAODv9/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v2/60000/E9D7E442-299E-FC4B-BA82-7D87E3A82B7D.root

amaltaro commented 1 year ago

@haozturk it looks like we deviated from the original problem reported here and we should concentrate our investigation solely to the workflows that you listed in the initial description. @vkuznet (or @todor-ivanov) please let me know if you need any input to proceed with this debugging.

Regarding that workflow in particular, it has a filemismatch problem where an output file [1] is missing on DBS while it's on Rucio. We don't create ACDCs before fixing the mismatch. Do you think these issues are related?

Definitely! Rucio does not check whether the parent dataset exist or not, so we succeed injecting container/blocks/replicas into Rucio, while it keeps failing to get injected into DBS. Once we run an ACDC and the jobs succeed, this block that is continuously failing to be injected into DBS will manage to go through.

vkuznet commented 1 year ago

@amaltaro , I have 3 appointments tomorrow and may have no time to look at this, and I'm taking Fri/Mon off. Said that, we only need JSON for the original workflow/block. Once we obtain JSON it is easy to check with DBS what is missing. If you or @todor-ivanov can find which block and put this block into config on WMA to produce JSON then we can proceed.

vkuznet commented 1 year ago

@haozturk , @amaltaro , I came back to original request about /LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM dataset and checked submit6 WMA FNAL agent. Turns out, that errors come from blocks which already inserted. For example:

# DBSError from ComponentLog
2022-11-11 23:09:21,794:139930855028480:ERROR:DBSUploadPoller:Error trying to process block /NMSSM_XToYHTo2B2WTo2B4Q_MX-3000_MY-600_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM#7272f333-3ca5-4bc3-bfa6-e8ea81bea707 through DBS. Error: DBSError Code:110 Description:DBS DB insert record error Function:dbs.dataset_output_configs.Insert Message: Error: ORA-00001: unique constraint (CMS_DBS3_PROD_GLOBAL_OWNER.TUC_DC_1) violated

It clear shows that unique constrain is violated, and indeed, when I checked this block in DBS database I see the following:

SQL> select block_name from cms_dbs3_prod_global_owner.blocks where block_name='/NMSSM_XToYHTo2B2WTo2B4Q_MX-3000_MY-600_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM#7272f333-3ca5-4bc3-bfa6-e8ea81bea707';

BLOCK_NAME
--------------------------------------------------------------------------------
/NMSSM_XToYHTo2B2WTo2B4Q_MX-3000_MY-600_TuneCP5_13TeV-madgraph-pythia8/RunIISumm
er20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM#7272f333-3ca5-4bc3-bfa6
-e8ea81bea707

From ComponentLog I see that we have 536 errors like this, e.g.

grep ERROR install/wmagentpy3/DBS3Upload/ComponentLog  | grep RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1 | grep violated | wc -l
536

and 268 non-violated errors

grep ERROR install/wmagentpy3/DBS3Upload/ComponentLog  | grep RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1 | grep -v violated | wc -l
268

If I look at one of those non-violated errors, it says:

2022-11-11 23:09:21,794:139930855028480:ERROR:DBSUploadPoller:Error found in multiprocess during process of block /NMSSM_XToYHTo2B2WTo2B4Q_MX-3000_MY-600_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM#7272f333-3ca5-4bc3-bfa6-e8ea81bea707

which comes from the fact that this this block is injected with others and it fails.

Bottom line, there is nothing wrong with DBS per-se, the WMA tries to inject block which is already exist. How this happen and why underlying workflow does not check for existing block I do not know. I also do not know how cycle process work. Obviously from this investigation it is clear that WMA workflow only retries without checking that underlying JSON can be injected. We already identified that either dataset parents are not there, or WMA tries to inject the same data over and over again. I do not see anything needs to be done on DBS side here and rather DBSUploadPoller should be adjusted to properly check the data it tries to inject and check errors. For later, I already deployed new version of DBS server code which will provide more precise errors.

Do you need to check any other dataset/block, if so, please provide necessary details.

From my point of view this issue is clear, there is no sufficient checks in DBSUploadPoller to check injected data before the injection and this is the root of the issue. And, most likely it is the same data racing conditions which we discussed in https://github.com/dmwm/WMCore/issues/11106

amaltaro commented 1 year ago

@haozturk , @amaltaro , I came back to original request about /LQToDEle_M-4000_single_TuneCP2_13TeV-madgraph-pythia8/RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM dataset and checked submit6 WMA FNAL agent. Turns out, that errors come from blocks which already inserted.

@vkuznet haven't we discussed and changed this behaviour in the past couple of months? I couldn't find that thread, but I am pretty sure you adapted the DBSWriter error to be consistent with the Py-server implementation, thus returning a very specific string error in the response object: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L96

which I can actually find in this baseline code: https://github.com/dmwm/dbs2go/blob/000d8b90c88bdf2faf0128f6f5371671d0762389/dbs/bulkblocks2.go#L670

So, why doesn't the client get that message now?

vkuznet commented 1 year ago

Ahh, I'm sorry, I mis-read the DBS error, in fact it says:

dataset_output_configs.Insert Message: Error: ORA-00001: unique constraint (CMS_DBS3_PROD_GLOBAL_OWNER.TUC_DC_1) violated

So, for given JSON we provide dataset_output_configs and it already exists in DBS, therefore the block injection fails. It is a different error as Block already exists since dataset_output_configs injection happens before we inject the block. But it is the same data racing conditions, i.e. we try to insert a data from JSON which is already present in DBS.

amaltaro commented 1 year ago

So, for given JSON we provide dataset_output_configs and it already exists in DBS, therefore the block injection fails. It is a different error as Block already exists since dataset_output_configs injection happens before we inject the block. But it is the same data racing conditions, i.e. we try to insert a data from JSON which is already present in DBS.

It's still not clear to me. Every time we retry the injection of a block that already exists in the database, the dataset_output_configs should be already there (given that it happens before the block injection itself), hence I would expect that error to be reported instead of a "Block %s already exists". In addition, if this is really the case, I would be surprise to see this error for the very first time only now. Can you please check things again, something doesn't match in your debugging.

vkuznet commented 1 year ago

Alan, frankly I was not aware of internal logic of DBSUploadPoller till resolving this ticket. The DBS bulkblocks API logic starts with inserting necessary stuff before injecting the block, e.g. insert dataset configuration, see https://github.com/dmwm/dbs2go/blob/000d8b90c88bdf2faf0128f6f5371671d0762389/dbs/bulkblocks2.go#L515 Only after all necessary aux info is inserted the API insert the block info. Therefore, with repeated insert pattern this obviously creates an issue with violation of already inserted information such as dataset configuration. Based on new logic I learnt from this ticket (about possible periodic injection of the same JSON (block)) I moved check block as a first part of the API, see https://github.com/dmwm/dbs2go/commit/4b8fa39bad91d103502523568ffae4dee892fdc3. I created new release and will put it on testbed as soon as image is ready. This should resolve the situation when DBSUploadPoller tries to inject the same block over and over, since now its existence will be checked first by bulblocks DBS server API.

vkuznet commented 1 year ago

The new version of DBS server which now checks block existence as a first thing in the API is available on testbed. Please try to insert the same block over and over to check if we'll fix the problem (meanwhile I'll prepare unit test for this as well).

amaltaro commented 1 year ago

Thank you for the prompt fix, Valentin. However, given that block existence is checked in the beginning now, what ensures that all the rest of the data is already properly inserted in the database? I would be very very hard to hit a problem like that, but it's not impossible.

vkuznet commented 1 year ago

In order to block to be inserted all its aux information should be in DBS. So far I can't see how it can be impossible. Please elaborate and provide details how you envision see. Since everything is connected via DB foreign keys so far I do not see any possibility how it can be possible. If block exists in DB it guarantee from DB foreign key constrains that all aux information should be in DB.

amaltaro commented 1 year ago

Valentin, thanks for the clarification during the meeting, confirming that this use case should not happen and checking for the block existence at the beginning of the transaction is fine (given that it relies on the foreign keys to ensure the required data has been inserted).

vkuznet commented 1 year ago

Alan, I wrote unit test and I tested manually to insert the same block over and over, and in my local setup everything is fine, i.e. I'm getting Block bla already exists message. So, I think DBS server is ready, but it would be nice if you'll run injection against testbed and repeat it with the same block.

amaltaro commented 1 year ago

Valentin, to test these changes from the testbed server, we can pick any testbed WMAgent and change their database records such that a block - and its files - is considered again for DBS injection.

I picked vocms0263 and found this block that had already been inserted in DBSServer on Nov 16: /RelValProdMinBias/DMWM_Test-RECOPROD1_TaskChain_ProdMinBias_bugFix_TEST_injectScript_v4-v11/GEN-SIM-RECO#fdd5459d-80c3-46cb-b5ed-66e90af8f7ed

on the local database (this case MariaDB), these are the 2 update statements that I executed:

# marking the block as not injected
MariaDB [wmagent]> UPDATE dbsbuffer_block SET status='Pending' where blockname='/RelValProdMinBias/DMWM_Test-RECOPROD1_TaskChain_ProdMinBias_bugFix_TEST_injectScript_v4-v11/GEN-SIM-RECO#fdd5459d-80c3-46cb-b5ed-66e90af8f7ed' and status='Closed';
Query OK, 1 row affected (0.002 sec)
Rows matched: 1  Changed: 1  Warnings: 0

# marking its files as not injected (single file in this case)
MariaDB [wmagent]> UPDATE dbsbuffer_file SET status='NOTUPLOADED' where block_id=(SELECT id from dbsbuffer_block where blockname='/RelValProdMinBias/DMWM_Test-RECOPROD1_TaskChain_ProdMinBias_bugFix_TEST_injectScript_v4-v11/GEN-SIM-RECO#fdd5459d-80c3-46cb-b5ed-66e90af8f7ed') AND status='InDBS';
Query OK, 1 row affected (0.002 sec)
Rows matched: 1  Changed: 1  Warnings: 0

I restarted DBS3Upload, to make sure it picks this up quickly and then checked the component log, which reports:

2022-11-21 19:19:56,166:140209894156096:INFO:DBSUploadPoller:About to call insert block for: /RelValProdMinBias/DMWM_Test-RECOPROD1_TaskChain_ProdMinBias_bugFix_TEST_injectScript_v4-v11/GEN-SIM-RECO#fdd5459d-80c3-46cb-b5ed-66e90af8f7ed
2022-11-21 19:19:56,226:140209894156096:WARNING:DBSUploadPoller:Block /RelValProdMinBias/DMWM_Test-RECOPROD1_TaskChain_ProdMinBias_bugFix_TEST_injectScript_v4-v11/GEN-SIM-RECO#fdd5459d-80c3-46cb-b5ed-66e90af8f7ed already exists. Marking it as uploaded.

In short, I haven't checked anything else, but on what concerns the error that we are tackling here, the DBSServer changes look good to me.

vkuznet commented 1 year ago

Alan, thanks for details and confirmation. For the record, I have much simpler procedure to test things. Here it is:

# either get or use existing JSON file, e.g.
# https://github.com/dmwm/dbs2go/blob/master/test/data/bulkblocks1.json

# make curl call to your DBS server, I run it on my loclahost, but we can use testbed or any other server
curl -v -X POST -H "Content-Type: application/json" -d@test/data/bulkblocks1.json http://localhost:9898/dbs2go/bulkblocks

that's it. I run it with my local server twice, and second call was producing correct error. I just want to share that you or other will know that testing DBS server can be done using normal curl and plain API.

Said that, I'll prepare new stable production dbs2go image and can deploy it at any time either today or tomorrow. The only changes new server will bring is new error codes and this fix, i.e. here is git log output and on production servers we run version v00.06.39

bc37f62 (HEAD -> master, origin/master, origin/HEAD) Add new integration test to test injection of the same block
4b8fa39 (tag: v00.06.41) Add helper function checkBlockExist; use it as first part of bulblocks API logic
000d8b9 (tag: v00.06.40) Add DoesNotExist error codes which is used by bulkblocks2.go codebase
f9dde71 (tag: v00.06.39) Merge pull request #85 from d-ylee/testRaceConditions

Please let me know if you are ok with DBS server upgrade on production servers.

amaltaro commented 1 year ago

Alan, thanks for details and confirmation. For the record, I have much simpler procedure to test things. Here it is:

It's definitely simpler! And the reason for that is that you are actually not validating anything in WMCore (neither the component nor the dbs3-client)! The component does not have any REST API, so it cannot be tested with the same procedure as done for the DBS server.

Please let me know if you are ok with DBS server upgrade on production servers.

Yes, please go ahead with the production upgrade such that we can monitor things before the long holidays. Please update MM #DMWM channel as well. Thanks

amaltaro commented 1 year ago

@vkuznet Valentin, now that you deployed a new version of the DBS Server in production, can you please have a look at submit6 and see if those errors have cleared out?

If they did, then i think we can call this issue as done and close it out.

vkuznet commented 1 year ago

I checked reported block in question, and last entry in ComponentLog is dated by 2022-11-11

2022-11-11 23:11:38,771:139930855028480:INFO:DBSUploadPoller:About to call insert block for: /NMSSM_XToYHTo2B2WTo2B4Q_MX-3000_MY-600_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL17NanoAODv9-106X_mc2017_realistic_v9-v1/NANOAODSIM#7272f333-3ca5-4bc3-bfa6-e8ea81bea707

Said that, there are different errors in ComponentLog which I would like to clarify, e.g.

2022-11-23 11:48:23,569:140500037646080:ERROR:DBSUploadPoller:Error found in multiprocess during process of block /DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#397d4e07-e918-4c6d-ab5b-b074f1b83ee2
2022-11-23 11:48:23,569:140500037646080:ERROR:DBSUploadPoller:Error trying to process block /DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#397d4e07-e918-4c6d-ab5b-b074f1b83ee2 through DBS. Error: DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set

This particular block /DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#397d4e07-e918-4c6d-ab5b-b074f1b83ee2 is not in DBS DB (I checked this through sqlplus session). Therefore, we need a JSON dump for this block to identify the root cause of the issue. @amaltaro ,do you mind to change WMA config and add this block to config.DBS3Upload.dumpBlockJsonFor = configuration that we'll get its JSON dump. Once I have new file I can have a look what is wrong with JSON.

amaltaro commented 1 year ago

I just logged in in submit6 and it looks like we are already getting a dump for that block for a week now:

2022-11-23 12:53:11,850:140500037646080:INFO:DBSUploadPoller:Dumping '/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM#397d4e07-e918-4c6d-ab5b-b074f1b83ee2' information into /storage/local/data1/cmsdataops/srv/wmagent/v2.1.2.patch1/install/wmagentpy3/DBS3Upload/dbsuploader_block.json

I had a quick look at it and it seems to be missing the parent dataset as well: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/blocks?dataset=/DYJetsToLL_M-2000to3000_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v2/MINIAODSIM

These output datasets are created by:

    "RequestName": "cmsunified_task_SMP-RunIISummer20UL16MiniAODv2-00275__v1_T_221103_122546_3035",

@haozturk could you please create an ACDC to recover those? Once it succeeds, we should have the MINIAODSIM in the DBS database and the NANOAODSIM blocks will go through as well.

haozturk commented 1 year ago

I submitted ACDC for [1]

[1] https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_SMP-RunIISummer20UL16MiniAODv2-00275

vkuznet commented 1 year ago

@haozturk , what is a status of your ACDC submission, is it completed now, do we have any DBS errors?

amaltaro commented 1 year ago

@vkuznet I just discussed this issue again with the P&R team and they reminded us that this original issue does not involve any ACDC/Resubmission workflows, given that the workflows didn't have any failures.

Could you please look into the workflows reported in the initial description and perform a procedure similar to what I mentioned here: https://github.com/dmwm/WMCore/issues/11358#issuecomment-1303671315

haozturk commented 1 year ago

Hi all,

ACDC seems successful
And yes, I agree w/ Alan. Note that, the number of workflows getting affected by this issue is increasing. You can find the full list here [1]. I couldn't quickly find the recent ones.

https://cms-unified.web.cern.ch/cms-unified//assistance.html#assistance-noRecoveryDoc

vkuznet commented 1 year ago

@haozturk , @amaltaro , I'm not sure I understand the issue anymore. I checked first four workflows from Hasan's link and all of them were completed 100% in wmstats. For instance,

workflow pdmvserv_Run2022E_BTagMu_PromptNanoAODv10_v1_221017_083712_7119
dataset: /BTagMu/Run2022E-PromptNanoAODv10_v1-v3/NANOAOD
wmstats reports, job progress, event progress, lumi progress 100%, failure rate 0.4%

I can see above dataset in DBS using dasgoclient and it has 47 files:

dasgoclient -query="file dataset=/BTagMu/Run2022E-PromptNanoAODv10_v1-v3/NANOAOD" | wc -l
47

and it has 24 blocks:

dasgoclient -query="block  dataset=/BTagMu/Run2022E-PromptNanoAODv10_v1-v3/NANOAOD" | wc -l
24

but in cms-unified page it says 4.16% completed which I have no idea what it refers to

Please make an effort to point to a single workflow which has this issue as I'm new to all the tools you use and I can easily misinterpret something.

So, what's the problem since wmstats reports 100% completed, and dataset is in DBS?

vkuznet commented 1 year ago

With Alan's help here are the results of few workflows. I looked up data in reqmgr2, the corresponding link is provided, then compared DBS info for input and output datasets. For three workflows below results are the same, while for ttbar dataset is different. We concluded with Alan that for those datasets whose info of input and output datasets are the same most likely the failure you see is due to invalidated (and not processed files). But for ttbar I don't have yet clear explanation. May be Alan may have a look.


pdmvserv_Run2022E_HcalNZS_PromptNanoAODv10_v1_221017_083712_5744
https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_Run2022E_HcalNZS_PromptNanoAODv10_v1_221017_083712_5744
  "TotalInputLumis": 63694,
input dataset: /HcalNZS/Run2022E-PromptReco-v1/MINIAOD
# here and below I use variable d as dataset value in das queries
/dasgoclient -query="summary dataset=$d"
[{"file_size":175135954967,"max_ldate":1667469485,"median_cdate":1665012903,"median_ldate":1665012903,"nblocks":245,"nevents":3379000,"nfiles":601,"nlumis":63483,"num_block":245,"num_event":3379000,"num_file":601,"num_lumi":63483}]
output dataset: /HcalNZS/Run2022E-PromptNanoAODv10_v1-v2/NANOAOD
./dasgoclient -query="summary dataset=$d"
[{"file_size":2137289191,"max_ldate":1667790642,"median_cdate":1666311664,"median_ldate":1666311664,"nblocks":29,"nevents":3379000,"nfiles":29,"nlumis":63483,"num_block":29,"num_event":3379000,"num_file":29,"num_lumi":63483}]

nlumis in input equal to nlumis in output datasets

---

pdmvserv_Run2022E_DisplacedJet_PromptNanoAODv10_v1_221017_083709_9356
https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_Run2022E_DisplacedJet_PromptNanoAODv10_v1_221017_083709_9356
  "TotalInputLumis": 25542,
input dataset: /DisplacedJet/Run2022E-PromptReco-v1/MINIAOD
./dasgoclient -query="summary dataset=$d"
[{"file_size":1512863461265,"max_ldate":1667459785,"median_cdate":1665515912,"median_ldate":1665515912,"nblocks":101,"nevents":22433500,"nfiles":626,"nlumis":25332,"num_block":101,"num_event":22433500,"num_file":626,"num_lumi":25332}]
output dataset: /DisplacedJet/Run2022E-PromptNanoAODv10_v1-v2/NANOAOD
./dasgoclient -query="summary dataset=$d"
[{"file_size":24658396399,"max_ldate":1667790642,"median_cdate":1666221061,"median_ldate":1666221061,"nblocks":5,"nevents":22433500,"nfiles":19,"nlumis":25332,"num_block":5,"num_event":22433500,"num_file":19,"num_lumi":25332}]

nlumis in input equal to nlumis in output datasets

---

pdmvserv_task_EXO-RunIISummer20UL18NanoAODv9-00948__v1_T_220227_022251_1319
https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_task_EXO-RunIISummer20UL18NanoAODv9-00948__v1_T_220227_022251_1319
  "TotalInputLumis": 401,
input dataset;
/TTbar01Jets_TypeIHeavyN-Mu_LepSMTop_3L_LO_MN20_TuneCP5_13TeV-madgraphMLM-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v3/MINIAODSIM
./dasgoclient -query="summary dataset=$d"
[{"file_size":28067626625,"max_ldate":1645891366,"median_cdate":null,"median_ldate":1645891360,"nblocks":16,"nevents":397395,"nfiles":23,"nlumis":401,"num_block":16,"num_event":397395,"num_file":23,"num_lumi":401}]
output dataset: 
/TTbar01Jets_TypeIHeavyN-Mu_LepSMTop_3L_LO_MN20_TuneCP5_13TeV-madgraphMLM-pythia8/RunIISummer20UL18NanoAODv9-106X_upgrade2018_realistic_v16_L1v1-v1/NANOAODSIM
[{"file_size":17263340,"max_ldate":1655538896,"median_cdate":1655538896,"median_ldate":1655538896,"nblocks":1,"nevents":6001,"nfiles":1,"nlumis":6,"num_block":1,"num_event":6001,"num_file":1,"num_lumi":6}]

nlumis in input is not equal to nlumis in output datasets

---

pdmvserv_Run2022E_Tau_PromptNanoAODv10_v1_221017_083712_5368
https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_Run2022E_Tau_PromptNanoAODv10_v1_221017_083712_5368
  "TotalInputLumis": 25548,

input dataset: /Tau/Run2022E-PromptReco-v1/MINIAOD
./dasgoclient -query="summary dataset=$d"
[{"file_size":1812192685094,"max_ldate":1667477948,"median_cdate":1665483238,"median_ldate":1665483238,"nblocks":105,"nevents":30969955,"nfiles":732,"nlumis":25338,"num_block":105,"num_event":30969955,"num_file":732,"num_lumi":25338}]
output dataset: /Tau/Run2022E-PromptNanoAODv10_v1-v3/NANOAOD
./dasgoclient -query="summary dataset=$d"
[{"file_size":28541610795,"max_ldate":1667790904,"median_cdate":1666219902.5,"median_ldate":1666219902.5,"nblocks":10,"nevents":30969955,"nfiles":28,"nlumis":25338,"num_block":10,"num_event":30969955,"num_file":28,"num_lumi":25338}]

nlumis in input equal to nlumis in output datasets

vkuznet commented 1 year ago

And Here is semi-automated workflow to check input vs output datasets. The steps below assumes that you have curl, dasgoclient and jq tools. I also use alias scrul which simply means curl -L -k --key ~/.globus/userkey.pem --cert ~/.globus/usercert.pem

Here is all steps which shows how to extract all relevant information if you know workflow name:

get workflow name, e.g. pdmvserv_Run2022E_Tau_PromptNanoAODv10_v1_221017_083712_5368
get total number of lumis

scurl -s "https://cmsweb.cern.ch/reqmgr2/data/request?name=pdmvserv_Run2022E_Tau_PromptNanoAODv10_v1_221017_083712_5368" | jq '.result[0].pdmvserv_Run2022E_Tau_PromptNanoAODv10_v1_221017_083712_5368.TotalInputLumis'

25548

get input dataset

scurl -s "https://cmsweb.cern.ch/reqmgr2/data/request?name=pdmvserv_Run2022E_Tau_PromptNanoAODv10_v1_221017_083712_5368" | jq '.result[0].pdmvserv_Run2022E_Tau_PromptNanoAODv10_v1_221017_083712_5368.InputDataset'
"/Tau/Run2022E-PromptReco-v1/MINIAOD"

run dasgoclient to get summary info for that dataset


d=/Tau/Run2022E-PromptReco-v1/MINIAOD
./dasgoclient -query="summary dataset=$d"
[{"file_size":1812192685094,"max_ldate":1667477948,"median_cdate":1665483238,"median_ldate":1665483238,"nblocks":105,"nevents":30969955,"nfiles":732,"nlumis":25338,"num_block":105,"num_event":30969955,"num_file":732,"num_lumi":25338}]

or if you want just the number

./dasgoclient -query="summary dataset=$d" | jq '.[0].num_lumi' 25338


- get output datasets:

scurl -s "https://cmsweb.cern.ch/reqmgr2/data/request?name=pdmvserv_Run2022E_Tau_PromptNanoAODv10_v1_221017_083712_5368" | jq '.result[0].pdmvserv_Run2022E_Tau_PromptNanoAODv10_v1_221017_083712_5368.OutputDatasets' [ "/Tau/Run2022E-PromptNanoAODv10_v1-v3/NANOAOD" ]

- run dasgoclient to get summary info for that dataset

d=/Tau/Run2022E-PromptNanoAODv10_v1-v3/NANOAOD ./dasgoclient -query="summary dataset=$d" [{"file_size":28541610795,"max_ldate":1667790904,"median_cdate":1666219902.5,"median_ldate":1666219902.5,"nblocks":10,"nevents":30969955,"nfiles":28,"nlumis":25338,"num_block":10,"num_event":30969955,"num_file":28,"num_lumi":25338}]

or if you want just the number

./dasgoclient -query="summary dataset=$d" | jq '.[0].num_lumi' 25338



This pseudo-automated workflow can be put into services which may accept workflow name and return if input and output datasets have the same or different number of lumis. As Alan explained to me the input can be constrained by different conditions. If someone can outline all possible conditions it will be easy to automate this workflow and provide either a tool or service to perform such checks.

vkuznet commented 1 year ago

Hi, I converted my semi-automatic workflow into a script which you can find here. And, I run it across all assistance-noRecoveryDoc workflows from Hasan's url as following:

since url is HTML I was forced to copy all entries from assistance-noRecoveryDoc
then I only extracted list of workflows and put them into workflows.txt file

finally I run my script against list of workflows as following:

time cat ~/Desktop/workflows2.txt | awk '{print "~/Desktop/check_input_output.sh "$1""}' | /bin/sh > ~/Desktop/input_output.txt

For 83 workflows it took around 5min (of sequential scan), and results look like:

pdmvserv_Run2022E_MuonEG_PromptNanoAODv10_v1_221017_083709_1670 nlumis=27035 input=/MuonEG/Run2022E-PromptReco-v1/MINIAOD inputLumis=26825 output=/MuonEG/Run2022E-PromptNanoAODv10_v1-v3/NANOAOD outputLumis=26825 OK
pdmvserv_Run2022E_HcalNZS_PromptNanoAODv10_v1_221017_083712_5744 nlumis=63694 input=/HcalNZS/Run2022E-PromptReco-v1/MINIAOD inputLumis=63483 output=/HcalNZS/Run2022E-PromptNanoAODv10_v1-v2/NANOAOD outputLumis=63483 OK
...

You can find full output here. I found that out of 83 workflows I scanned 65 are OK in terms of lumis for input and output datasets, and only 18 have different number of lumis.

Since reqmgr2 JSON schema is not known to me I can easily miss how correctly identify input and output dataset. Therefore for those 16 not ok'ish workflows the manual procedure should be applied. Said that, I think my script can be very useful for data-ops to make this checks and I suggest to convert this script into a services which will have single API which can scan single or list of workflows. The process can be parallelized for input list of workflows and I think it can take only few seconds. Let me know if you'll be interested in such service and I can cook up somethings.

amaltaro commented 1 year ago

As we just discussed in the P&R meeting, I checked whether any of the input files have been invalidated for this workflow: pdmvserv_task_EXO-RunIISummer20UL18NanoAODv9-00948__v1_T_220227_022251_1319

and the answer is negative. Here is how you can check it from one of the agents (once you have the InputDataset name):

cmst1@vocms0282:/data/srv/wmagent/current $ source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh

cmst1@vocms0282:/data/srv/wmagent/current $ python3 apps/wmagentpy3/bin/adhoc-scripts/checkDsetFileCount.py /TTbar01Jets_TypeIHeavyN-Mu_LepSMTop_3L_LO_MN20_TuneCP5_13TeV-madgraphMLM-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v3/MINIAODSIM
2022-12-07 16:51:53,447:INFO:checkDsetFileCount: *** Dataset: /TTbar01Jets_TypeIHeavyN-Mu_LepSMTop_3L_LO_MN20_TuneCP5_13TeV-madgraphMLM-pythia8/RunIISummer20UL18MiniAODv2-106X_upgrade2018_realistic_v16_L1v1-v3/MINIAODSIM
2022-12-07 16:51:53,447:INFO:checkDsetFileCount: Rucio file count : 23
2022-12-07 16:51:53,447:INFO:checkDsetFileCount: DBS file count   : 23
2022-12-07 16:51:53,447:INFO:checkDsetFileCount:  - valid files   : 23
2022-12-07 16:51:53,447:INFO:checkDsetFileCount:  - invalid files : 0
2022-12-07 16:51:53,447:INFO:checkDsetFileCount: Blocks in Rucio but not in DBS: set()
2022-12-07 16:51:53,447:INFO:checkDsetFileCount: Blocks in DBS but not in Rucio: set()

I let you @haozturk and @vkuznet check a couple of other workflows. Please keep in mind that the PromptNanoAOD workflows are expected to have incomplete stats because they have been completed prematurely.

vkuznet commented 1 year ago

@alan, and from another workflow I see invalidated files, e.g.

./wflow-dbs -workflow pdmvserv_Run2017G_LowEGJet_09Aug2019_UL2017_220531_180507_3352
[
   {
      "Workflow": "pdmvserv_Run2017G_LowEGJet_09Aug2019_UL2017_220531_180507_3352",
      "TotalInputLumis": 31372,
      "InputDataset": "/LowEGJet/Run2017G-v1/RAW",
      "OutputDataset": "/LowEGJet/Run2017G-09Aug2019_UL2017-v2/AOD",
      "InputDatasetLumis": 31372,
      "OutputDatasetLumis": 31372
   },
   {
      "Workflow": "pdmvserv_Run2017G_LowEGJet_09Aug2019_UL2017_220531_180507_3352",
      "TotalInputLumis": 31372,
      "InputDataset": "/LowEGJet/Run2017G-v1/RAW",
      "OutputDataset": "/LowEGJet/Run2017G-09Aug2019_UL2017-v2/MINIAOD",
      "InputDatasetLumis": 31372,
      "OutputDatasetLumis": 31372
   }
]

So, we have two datasets to compare, first let's take the input one:

d=/LowEGJet/Run2017G-v1/RAW
scurl -s "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/files?dataset=$d&validFileOnly=0" | grep logical | wc -l
   23666
scurl -s "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/files?dataset=$d&validFileOnly=1" | grep logical | wc -l
   23666

and, now we take output dataset

d=/LowEGJet/Run2017G-09Aug2019_UL2017-v2/AOD
scurl -s "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/files?dataset=$d&validFileOnly=0" | grep logical | wc -l
   12327
scurl -s "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/files?dataset=$d&validFileOnly=1" | grep_logical | wc -l
   12298

So, my observation is the following:

both input and output dataset have the same number of lumis
output dataset has invalid files, 12327 (total number of files) vs 12298 (valid files), so, 29 files were invalidated
input dataset (RAW) has more files than output dataset, 23666 vs 12298 (valid) vs 12327 (total)

From my observation we do have invalid files in output dataset, and we do have more files in RAW than in AOD. I honestly do not know how nlumis is calculated in DMWM tools and who/how put this info into DBS, but according to what we have in DBS we have invalid files in output dataset.

dmwm / WMCore

Successful (no job failure) workflows w/ missing output on DBS #11358

or if you want just the number

or if you want just the number