Open yuyiguo opened 5 years ago
@yuyiguo I don't see listFileParentsByLumi
being called anywhere in the WMCore code.
I also scanned all the DBSGlobalReader-* logs for one of the cmsweb backends and there was no match.
I think it's safe to deprecate/disable this API.
Thanks a lot Alan for confirming the API is not used anymore. I will disable it. Yuyi
From: Alan Malta Rodrigues notifications@github.com Reply-To: dmwm/DBS reply@reply.github.com Date: Monday, July 1, 2019 at 8:50 AM To: dmwm/DBS DBS@noreply.github.com Cc: Yuyi Guo yuyi@fnal.gov, Mention mention@noreply.github.com Subject: Re: [dmwm/DBS] Do we still need prentsbyLumi API? (#613)
@yuyiguohttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_yuyiguo&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=f0swq7Z3fIxL4L7-qONF1GyxeslqHMlcZf7EjNitD3A&s=swWXuAqHcmDMVSaY3BfDItL3wkFsvjr1JE7-Hb1vNRs&e= I don't see listFileParentsByLumi being called anywhere in the WMCore code. I also scanned all the DBSGlobalReader-* logs for one of the cmsweb backends and there was no match.
I think it's safe to deprecate/disable this API.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_613-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAANROTRILHPYZ7HLG4YYW3LP5IDTDA5CNFSM4H3DCQYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY6F6PA-23issuecomment-2D507273020&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=f0swq7Z3fIxL4L7-qONF1GyxeslqHMlcZf7EjNitD3A&s=YHaQHibLj6Zcq5H7H1OLQK8QgvcsKl0yt0GCGGyZehI&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AANROTXOS36E6RSW3QMTXBDP5IDTDANCNFSM4H3DCQYA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=f0swq7Z3fIxL4L7-qONF1GyxeslqHMlcZf7EjNitD3A&s=BJ8VontAxlr2g9-hk0d-doix65GyGpmHVvITIXeyPHM&e=.
Hi Yuyi,
let me get back to this issue. As we have recently noticed, we do rely heavily on this listFileParentsByLumi
API, which is used to update the parentage information for output datasets of StepChain workflows.
Indeed it times out for quite some blocks that we're trying to process, see WMCore issue: https://github.com/dmwm/WMCore/issues/9537
while trying to provide a slice of LFNs (in order to workaround the frontends timeout), I found the following server bug (reported as a client bug):
>>> dbsApi = DbsApi(url = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader/')
>>> block
'/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/RunIISummer19UL18HLT-102X_upgrade2018_realistic_v15-v2/GEN-SIM-RAW#1bb50416-6f34-40ce-b8e4-fb301d3853f3'
>>> listFiles[:5]
['/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/2E5C12EF-5BEA-6A4C-83E1-6CE10B9960CC.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/6D055153-A777-4C46-88EB-70B6873BDC72.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/9570805C-4EB2-D74C-BD25-81FB342FA1C6.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/0BC212B6-8A2C-2F4E-980F-429BD1AC56FE.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/5096D189-A9B5-9B4C-9D7F-158B655C9B5F.root']
>>>
>>> res = dbsApi.listFileParentsByLumi(block_name=block, logical_file_name=listFiles[:5])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/srv/wmagent/v1.2.8/sw/slc7_amd64_gcc630/cms/dbs3-client/3.10.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 484, in listFileParentsByLumi
return self.__callServer("fileparentsbylumi", data=kwargs, callmethod='POST')
File "/data/srv/wmagent/v1.2.8/sw/slc7_amd64_gcc630/cms/dbs3-client/3.10.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 201, in __callServer
self.__parseForException(http_error)
File "/data/srv/wmagent/v1.2.8/sw/slc7_amd64_gcc630/cms/dbs3-client/3.10.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 228, in __parseForException
raise HTTPError(http_error.url, data['exception'], data['message'], http_error.header, http_error.body)
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 500: (DatabaseError) ORA-00928: missing SELECT keyword
Do we have another option to try to make it working? The block above has 231 files, and that's already too much to fit within 5min.
The problem is likely somewhere in this DAO: https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/dao/Oracle/FileParent/ListFileParentageByLumi.py
but I'm having a hard time to read it, and I can't access the DBS oracle account, so I can't even play with it.
I am working on it. Yuyi
From: Alan Malta Rodrigues notifications@github.com Reply-To: dmwm/DBS reply@reply.github.com Date: Monday, February 10, 2020 at 8:20 AM To: dmwm/DBS DBS@noreply.github.com Cc: Yuyi Guo yuyi@fnal.gov, Mention mention@noreply.github.com Subject: Re: [dmwm/DBS] Do we still need prentsbyLumi API? (#613)
The problem is likely somewhere in this DAO: https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/dao/Oracle/FileParent/ListFileParentageByLumi.pyhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_blob_master_Server_Python_src_dbs_dao_Oracle_FileParent_ListFileParentageByLumi.py&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=kmVkMC-mgK0n2q6s8OjCKETjiO8XFtRBJBkXrbG76g8&s=1c2Kl7corAS8v9cRJE4YMynLv0d72esnfSfJCm_EPYA&e=
but I'm having a hard time to read it, and I can't access the DBS oracle account, so I can't even play with it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_613-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAANROTVEKLICOUFGUFYIYKLRCFPBNA5CNFSM4H3DCQYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELIVNSI-23issuecomment-2D584144585&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=kmVkMC-mgK0n2q6s8OjCKETjiO8XFtRBJBkXrbG76g8&s=KHwReZGpuFGL9y2XkjOZR1PU5yxkuYMFLh5wy2F9vsY&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AANROTWYP5JFJGHPVRLDDHTRCFPBNANCNFSM4H3DCQYA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=kmVkMC-mgK0n2q6s8OjCKETjiO8XFtRBJBkXrbG76g8&s=lbTcGqTB98w6hILhPWP8iXClk8ILOSnjZiLpJ8FR6JM&e=.
Alan,
It is a lot of message here. In simple words, I have no solution to get that work in 5 minutes. I tried the example blocks and list of files in this ticket with bare sql statements and no matter what I did, it just could not finish in 5 minutes. We have big datasets now and it could not finish in 5 minutes. Using LFN list will not reduce the searching time. I think the reason was because the LFNs were only limit to the children, but the parents were for the entire datasets. The children already much smaller than the parents because we use block to limit them.
Details:
The original sql:
`
WITH
TOKEN_GENERATOR AS (
SELECT REGEXP_SUBSTR(:token_0, '[^,]+', 1, LEVEL) token
FROM DUAL
CONNECT BY LEVEL <= LENGTH(:token_0) - LENGTH(REPLACE(:token_0, ',', '')) + 1
), parents as (
select run_num as R, Lumi_section_num as L, file_id as pid from cms_dbs3_prod_global_owner.file_lumis fl
where fl.file_id in
(select file_id from cms_dbs3_prod_global_owner.files f
where F.DATASET_ID in (select parent_dataset_id from cms_dbs3_prod_global_owner.dataset_parents dp
inner join cms_dbs3_prod_global_owner.datasets d on d.dataset_id=DP.THIS_DATASET_ID
where d.dataset = :child_ds_name ))),
children as
(
select run_num as R, Lumi_section_num as L, file_id as cid from cms_dbs3_prod_global_owner.file_lumis fl
where fl.file_id in (select file_id from cms_dbs3_prod_global_owner.files f
inner join cms_dbs3_prod_global_owner.blocks b on f.block_id = b.block_id
where b.block_name = :child_block_name and f.logical_file_name in (SELECT TOKEN FROM TOKEN_GENERATOR) )) )
select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L
`
There was bug there in the third sub query "children as". Oracle does not allow any sub query use the other sub queries. That is why we got error " (DatabaseError) ORA-00928: missing SELECT keyword".
So what I did was just remove the file list input and search for the entire block. See below query: ` WITH TOKEN_GENERATOR AS ( SELECT REGEXP_SUBSTR(:token_0, '[^,]+', 1, LEVEL) token FROM DUAL CONNECT BY LEVEL <= LENGTH(:token_0) - LENGTH(REPLACE(:token_0, ',', '')) + 1
), parents as (
select run_num as R, Lumi_section_num as L, file_id as pid from cms_dbs3_prod_global_owner.file_lumis fl
where fl.file_id in
(select file_id from cms_dbs3_prod_global_owner.files f
where F.DATASET_ID in (select parent_dataset_id from cms_dbs3_prod_global_owner.dataset_parents dp
inner join cms_dbs3_prod_global_owner.datasets d on d.dataset_id=DP.THIS_DATASET_ID
where d.dataset = :child_ds_name ))),
children as
(
select run_num as R, Lumi_section_num as L, file_id as cid from cms_dbs3_prod_global_owner.file_lumis fl
where fl.file_id in (select file_id from cms_dbs3_prod_global_owner.files f
inner join cms_dbs3_prod_global_owner.blocks b on f.block_id = b.block_id
where b.block_name = :child_block_name ))
select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L
;
` above query took 5:18 minutes.
Then I added the file list back to the query as below, instead of use subquery, with with clause, I put the query directly there. ` WITH parents as (
select run_num as R, Lumi_section_num as L, file_id as pid from cms_dbs3_prod_global_owner.file_lumis fl
where fl.file_id in
(select file_id from cms_dbs3_prod_global_owner.files f
where F.DATASET_ID in (select parent_dataset_id from cms_dbs3_prod_global_owner.dataset_parents dp
inner join cms_dbs3_prod_global_owner.datasets d on d.dataset_id=DP.THIS_DATASET_ID
where d.dataset = :child_ds_name ))),
children as
(
select run_num as R, Lumi_section_num as L, file_id as cid from cms_dbs3_prod_global_owner.file_lumis fl
where fl.file_id in (select file_id from cms_dbs3_prod_global_owner.files f
inner join cms_dbs3_prod_global_owner.blocks b on f.block_id = b.block_id
where b.block_name = :child_block_name and F.LOGICAL_FILE_NAME in (SELECT REGEXP_SUBSTR(:token_0, '[^,]+', 1, LEVEL) token
FROM DUAL
CONNECT BY LEVEL <= LENGTH(:token_0) - LENGTH(REPLACE(:token_0, ',', '')) + 1)) )
select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L
; ` This query took 6:43 minutes. As we can see limit with LFns did not help.
I see! Thank you for looking into it, Yuyi. We will have then to redesign the whole StepChain parentage, in addition to a possible workaround for the upcoming days/weeks (like building the parentage on the client side, and posting the parentage relationship directly to an API). Is there such DBS write API?
Alan, I am not sure I understand you question. Are you asking how should we handle the missing parentage data already in DBS? I am still thinking how to fix that.
Alan,
In order to handle the data already generated in DBS. I am think that we may break listFileParentsbyLumi(bock_name) API into two APIs.
listParentsFileLumis(child_ds_name), this is the compound query's first part. This find the lumi number and file id of the parent dataset.
listFileLumis(child_block_name), this is the compound query's second part. This find the lumi number and file id of the child block.
Once we have both data. We can find the unique match of 1 and 2 in python.
If you think this is something you guys want to approach, I will test if we can get all the lumi numbers for a big dataset in 5 minutes.
Yuyi
I really don't know if proposed solution will sustain since it seems to me that sooner or later you'll again hit the limit of 5 min. This I consider as a temporary fix but it does not solve the problem. If DB can't handle the load with parentage they way DMWM queries it we can't fix it in APIs, we either need to do something on DB level to speed up those queries, e.g. run procedure function to generate this info in background, re-factor DB to handle parentage use-case, or move/use other solution (HDFS) to get parentage.
But before that I rather prefer to see full description WHY do we need to support this use-case. From ticket description it is unclear why DMWM needs this, apart that it is heavily rely on this.
Yuyi,
I'm not sure I followed your suggestion. Are you saying that we could:
And fix the parentage relationship on the application side, then injecting a list of parent file ids for a given child file id? If this is what you're saying, then we should likely use block level operations to avoid eating the whole memory when parsing it.
Valentin, the problem has been reported here: https://github.com/dmwm/WMCore/issues/8269
and this is how we decided to solve StepChain parentage handling; because things happen asynchronously, meaning that we could insert into DBS data for a NANOAODSIM dataset, while the AODSIM hasn't even been merged yet.
I don't discard having the posssibility to fix it withing WMAgent, but I'm pretty sure that will be a substantial change and it can't be done within a few weeks.
Valentin, Alan,
What I proposed was that a temperate solution to fix the existing data that already in DBS.
The questions are:
When I proposed that solution, I thought that
Maybe we should discuss more on the problem before offering any fixing. Thanks, Yuyi
Yuyi, it actually isn't/wasn't in our plans to modify how the StepChain parentage works because I wasn't aware of such limitations. I can try to find out the number of blocks without parentage information, but it's likely around hundreds of blocks (hopefully not a few thousand).
I think the question back then when this issue was being discussed was: how much work would it be to get such problem fixed in WMCore and in DBS. Which is still a valid question. As I said before, we can certainly have a fix for it in WMAgent, likely not a clean solution, and likely something that will take us weeks of development; not impossible though.
Alan, If I understood you correctly, WMAgent/stepChain will always send broken/partial data to DBS? In other words, we are looking for solutions to fix the data in DBS instead of fixing the source of the problem.
WMAgents have their own databases, Can the partial data wait in its local db? WMAgent insert NANOAODSIM dataset into its local database while waiting for AODSIM is merged. Then upload the completed dataset/block into DBS?
Your understanding is correct, Yuyi. The merging step is asynchronous, and so is data injection against phedex/DBS. So yes, StepChain output will likely be always broken.
Yes, we can definetely implement what we need in the agents. However, I'd like first to explore a DBS/database-side option, if we still can find one.
FYI @todor-ivanov
Alan, How you fix the data in DBS now? Do you have a cron job running in some frequency to fix it?
We are going to partition DBS files and file_lumis tables. I think the queries against these two tables will be improved after the partition. But the partition will take some time. How long can you wait? Kate is out of office this week. I will discuss with her the partition schedule next week.
We have a cherrypy thread running every 3h: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py
even though it looks like the current cycle is running since the cmsweb production upgrade (it's has been implemented sequentially). Do you think it could hurt DBS too badly if we fix the parentage with X concurrent requests against DBS (like 10 blocks concurrently)?
I believe it would be okay to wait for a few weeks, but we might be unlucky and have users looking into those samples and their parents, which will increase the priority to get it fixed.
I'm also about to leave on vacation (day after tomorrow).
Alan, if single query takes 5 min or more, then your concurrent access will be worth since query will use transactions and all concurrent queries will block DBS.
Even though a temp solution may be put in place you need to fix the root of the problem. From what I read and understood in this ticket the proper solution should be done in wmcore and not in DBS code.
Sent from ProtonMail mobile
-------- Original Message -------- On Feb 11, 2020, 11:01 PM, Alan Malta Rodrigues wrote:
We have a cherrypy thread running every 3h: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py
even though it looks like the current cycle is running since the cmsweb production upgrade (it's has been implemented sequentially). Do you think it could hurt DBS too badly if we fix the parentage with X concurrent requests against DBS (like 10 blocks concurrently)?
I believe it would be okay to wait for a few weeks, but we might be unlucky and have users looking into those samples and their parents, which will increase the priority to get it fixed.
I'm also about to leave on vacation (day after tomorrow).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
I had the impression that select/join statements wouldn't block tables; we are also not updating anything within the same transaction, so it might not hurt the database performance. Yuyi, can you please confirm that?
But if select requires full table scan it means that N queries will stale. Anyway, it's better that Yuyi confirm.
Sent from ProtonMail mobile
-------- Original Message -------- On Feb 12, 2020, 8:22 AM, Alan Malta Rodrigues wrote:
I had the impression that select/join statements wouldn't block tables; we are also not updating anything within the same transaction, so it might not hurt the database performance. Yuyi, can you please confirm that?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
@amaltaro The listFileParentsbyLumi(bock_name) API was requested by Seangchan while he had to deal with a bug in the agent that made some of files don't have parents. We had an unit test for the API. However, this API most time cannot finished in 5 minutes due to the blocks and database are much bigger than the time we created the API. This API was only created for Seancheng to do the recovery. If no more recovery to do. I 'd like to disable it.