dmwm / DBS

CMS Dataset Bookkeeping Service
Apache License 2.0
7 stars 21 forks source link

Do we still need prentsbyLumi API? #613

Open yuyiguo opened 5 years ago

yuyiguo commented 5 years ago

@amaltaro The listFileParentsbyLumi(bock_name) API was requested by Seangchan while he had to deal with a bug in the agent that made some of files don't have parents. We had an unit test for the API. However, this API most time cannot finished in 5 minutes due to the blocks and database are much bigger than the time we created the API. This API was only created for Seancheng to do the recovery. If no more recovery to do. I 'd like to disable it.

amaltaro commented 5 years ago

@yuyiguo I don't see listFileParentsByLumi being called anywhere in the WMCore code. I also scanned all the DBSGlobalReader-* logs for one of the cmsweb backends and there was no match.

I think it's safe to deprecate/disable this API.

yuyiguo commented 5 years ago

Thanks a lot Alan for confirming the API is not used anymore. I will disable it. Yuyi

From: Alan Malta Rodrigues notifications@github.com Reply-To: dmwm/DBS reply@reply.github.com Date: Monday, July 1, 2019 at 8:50 AM To: dmwm/DBS DBS@noreply.github.com Cc: Yuyi Guo yuyi@fnal.gov, Mention mention@noreply.github.com Subject: Re: [dmwm/DBS] Do we still need prentsbyLumi API? (#613)

@yuyiguohttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_yuyiguo&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=f0swq7Z3fIxL4L7-qONF1GyxeslqHMlcZf7EjNitD3A&s=swWXuAqHcmDMVSaY3BfDItL3wkFsvjr1JE7-Hb1vNRs&e= I don't see listFileParentsByLumi being called anywhere in the WMCore code. I also scanned all the DBSGlobalReader-* logs for one of the cmsweb backends and there was no match.

I think it's safe to deprecate/disable this API.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_613-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAANROTRILHPYZ7HLG4YYW3LP5IDTDA5CNFSM4H3DCQYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY6F6PA-23issuecomment-2D507273020&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=f0swq7Z3fIxL4L7-qONF1GyxeslqHMlcZf7EjNitD3A&s=YHaQHibLj6Zcq5H7H1OLQK8QgvcsKl0yt0GCGGyZehI&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AANROTXOS36E6RSW3QMTXBDP5IDTDANCNFSM4H3DCQYA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=f0swq7Z3fIxL4L7-qONF1GyxeslqHMlcZf7EjNitD3A&s=BJ8VontAxlr2g9-hk0d-doix65GyGpmHVvITIXeyPHM&e=.

amaltaro commented 4 years ago

Hi Yuyi,

let me get back to this issue. As we have recently noticed, we do rely heavily on this listFileParentsByLumi API, which is used to update the parentage information for output datasets of StepChain workflows.

Indeed it times out for quite some blocks that we're trying to process, see WMCore issue: https://github.com/dmwm/WMCore/issues/9537

while trying to provide a slice of LFNs (in order to workaround the frontends timeout), I found the following server bug (reported as a client bug):

>>> dbsApi = DbsApi(url = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader/')
>>> block
'/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/RunIISummer19UL18HLT-102X_upgrade2018_realistic_v15-v2/GEN-SIM-RAW#1bb50416-6f34-40ce-b8e4-fb301d3853f3'
>>> listFiles[:5]
['/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/2E5C12EF-5BEA-6A4C-83E1-6CE10B9960CC.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/6D055153-A777-4C46-88EB-70B6873BDC72.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/9570805C-4EB2-D74C-BD25-81FB342FA1C6.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/0BC212B6-8A2C-2F4E-980F-429BD1AC56FE.root', '/store/mc/RunIISummer19UL18HLT/TTToHadronic_hdampDOWN_TuneCP5_13TeV-powheg-pythia8/GEN-SIM-RAW/102X_upgrade2018_realistic_v15-v2/280000/5096D189-A9B5-9B4C-9D7F-158B655C9B5F.root']
>>> 
>>> res = dbsApi.listFileParentsByLumi(block_name=block, logical_file_name=listFiles[:5])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/srv/wmagent/v1.2.8/sw/slc7_amd64_gcc630/cms/dbs3-client/3.10.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 484, in listFileParentsByLumi
    return self.__callServer("fileparentsbylumi", data=kwargs, callmethod='POST')
  File "/data/srv/wmagent/v1.2.8/sw/slc7_amd64_gcc630/cms/dbs3-client/3.10.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 201, in __callServer
    self.__parseForException(http_error)
  File "/data/srv/wmagent/v1.2.8/sw/slc7_amd64_gcc630/cms/dbs3-client/3.10.0/lib/python2.7/site-packages/dbs/apis/dbsClient.py", line 228, in __parseForException
    raise HTTPError(http_error.url, data['exception'], data['message'], http_error.header, http_error.body)
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 500: (DatabaseError) ORA-00928: missing SELECT keyword

Do we have another option to try to make it working? The block above has 231 files, and that's already too much to fit within 5min.

amaltaro commented 4 years ago

The problem is likely somewhere in this DAO: https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/dao/Oracle/FileParent/ListFileParentageByLumi.py

but I'm having a hard time to read it, and I can't access the DBS oracle account, so I can't even play with it.

yuyiguo commented 4 years ago

I am working on it. Yuyi

From: Alan Malta Rodrigues notifications@github.com Reply-To: dmwm/DBS reply@reply.github.com Date: Monday, February 10, 2020 at 8:20 AM To: dmwm/DBS DBS@noreply.github.com Cc: Yuyi Guo yuyi@fnal.gov, Mention mention@noreply.github.com Subject: Re: [dmwm/DBS] Do we still need prentsbyLumi API? (#613)

The problem is likely somewhere in this DAO: https://github.com/dmwm/DBS/blob/master/Server/Python/src/dbs/dao/Oracle/FileParent/ListFileParentageByLumi.pyhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_blob_master_Server_Python_src_dbs_dao_Oracle_FileParent_ListFileParentageByLumi.py&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=kmVkMC-mgK0n2q6s8OjCKETjiO8XFtRBJBkXrbG76g8&s=1c2Kl7corAS8v9cRJE4YMynLv0d72esnfSfJCm_EPYA&e=

but I'm having a hard time to read it, and I can't access the DBS oracle account, so I can't even play with it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_613-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAANROTVEKLICOUFGUFYIYKLRCFPBNA5CNFSM4H3DCQYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELIVNSI-23issuecomment-2D584144585&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=kmVkMC-mgK0n2q6s8OjCKETjiO8XFtRBJBkXrbG76g8&s=KHwReZGpuFGL9y2XkjOZR1PU5yxkuYMFLh5wy2F9vsY&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AANROTWYP5JFJGHPVRLDDHTRCFPBNANCNFSM4H3DCQYA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=kmVkMC-mgK0n2q6s8OjCKETjiO8XFtRBJBkXrbG76g8&s=lbTcGqTB98w6hILhPWP8iXClk8ILOSnjZiLpJ8FR6JM&e=.

yuyiguo commented 4 years ago

Alan,

It is a lot of message here. In simple words, I have no solution to get that work in 5 minutes. I tried the example blocks and list of files in this ticket with bare sql statements and no matter what I did, it just could not finish in 5 minutes. We have big datasets now and it could not finish in 5 minutes. Using LFN list will not reduce the searching time. I think the reason was because the LFNs were only limit to the children, but the parents were for the entire datasets. The children already much smaller than the parents because we use block to limit them.

Details:

The original sql: `
WITH TOKEN_GENERATOR AS ( SELECT REGEXP_SUBSTR(:token_0, '[^,]+', 1, LEVEL) token FROM DUAL CONNECT BY LEVEL <= LENGTH(:token_0) - LENGTH(REPLACE(:token_0, ',', '')) + 1

), parents as (

    select run_num as R, Lumi_section_num as L, file_id as pid from cms_dbs3_prod_global_owner.file_lumis fl       
    where fl.file_id in 
             (select file_id from cms_dbs3_prod_global_owner.files f
                 where F.DATASET_ID in (select parent_dataset_id from cms_dbs3_prod_global_owner.dataset_parents dp        
                  inner join cms_dbs3_prod_global_owner.datasets d on d.dataset_id=DP.THIS_DATASET_ID         
                  where d.dataset = :child_ds_name ))),             

        children as            
        (       
   select  run_num as R, Lumi_section_num as L, file_id as cid from cms_dbs3_prod_global_owner.file_lumis fl        
        where fl.file_id in (select file_id from cms_dbs3_prod_global_owner.files f        
         inner join cms_dbs3_prod_global_owner.blocks b on f.block_id = b.block_id         
         where b.block_name = :child_block_name and f.logical_file_name in (SELECT TOKEN FROM TOKEN_GENERATOR) ))                                             )            

select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L
` There was bug there in the third sub query "children as". Oracle does not allow any sub query use the other sub queries. That is why we got error " (DatabaseError) ORA-00928: missing SELECT keyword".

So what I did was just remove the file list input and search for the entire block. See below query: ` WITH TOKEN_GENERATOR AS ( SELECT REGEXP_SUBSTR(:token_0, '[^,]+', 1, LEVEL) token FROM DUAL CONNECT BY LEVEL <= LENGTH(:token_0) - LENGTH(REPLACE(:token_0, ',', '')) + 1

), parents as (

    select run_num as R, Lumi_section_num as L, file_id as pid from cms_dbs3_prod_global_owner.file_lumis fl       
    where fl.file_id in 
             (select file_id from cms_dbs3_prod_global_owner.files f
                 where F.DATASET_ID in (select parent_dataset_id from cms_dbs3_prod_global_owner.dataset_parents dp        
                  inner join cms_dbs3_prod_global_owner.datasets d on d.dataset_id=DP.THIS_DATASET_ID         
                  where d.dataset = :child_ds_name ))),             

        children as            
        (       
   select  run_num as R, Lumi_section_num as L, file_id as cid from cms_dbs3_prod_global_owner.file_lumis fl        
        where fl.file_id in (select file_id from cms_dbs3_prod_global_owner.files f        
         inner join cms_dbs3_prod_global_owner.blocks b on f.block_id = b.block_id         
         where b.block_name = :child_block_name ))    

select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L
;

` above query took 5:18 minutes.

Then I added the file list back to the query as below, instead of use subquery, with with clause, I put the query directly there. ` WITH parents as (

    select run_num as R, Lumi_section_num as L, file_id as pid from cms_dbs3_prod_global_owner.file_lumis fl       
    where fl.file_id in 
             (select file_id from cms_dbs3_prod_global_owner.files f
                 where F.DATASET_ID in (select parent_dataset_id from cms_dbs3_prod_global_owner.dataset_parents dp        
                  inner join cms_dbs3_prod_global_owner.datasets d on d.dataset_id=DP.THIS_DATASET_ID         
                  where d.dataset = :child_ds_name ))),             

        children as            
        (       
   select  run_num as R, Lumi_section_num as L, file_id as cid from cms_dbs3_prod_global_owner.file_lumis fl        
        where fl.file_id in (select file_id from cms_dbs3_prod_global_owner.files f        
         inner join cms_dbs3_prod_global_owner.blocks b on f.block_id = b.block_id         
         where b.block_name = :child_block_name and F.LOGICAL_FILE_NAME in (SELECT REGEXP_SUBSTR(:token_0, '[^,]+', 1, LEVEL) token
    FROM DUAL
    CONNECT BY LEVEL <= LENGTH(:token_0) - LENGTH(REPLACE(:token_0, ',', '')) + 1)) )       

 select distinct cid, pid from children c inner join parents p on c.R = p.R and c.L = p.L 

; ` This query took 6:43 minutes. As we can see limit with LFns did not help.

amaltaro commented 4 years ago

I see! Thank you for looking into it, Yuyi. We will have then to redesign the whole StepChain parentage, in addition to a possible workaround for the upcoming days/weeks (like building the parentage on the client side, and posting the parentage relationship directly to an API). Is there such DBS write API?

yuyiguo commented 4 years ago

Alan, I am not sure I understand you question. Are you asking how should we handle the missing parentage data already in DBS? I am still thinking how to fix that.

yuyiguo commented 4 years ago

Alan,

In order to handle the data already generated in DBS. I am think that we may break listFileParentsbyLumi(bock_name) API into two APIs.

  1. listParentsFileLumis(child_ds_name), this is the compound query's first part. This find the lumi number and file id of the parent dataset.

  2. listFileLumis(child_block_name), this is the compound query's second part. This find the lumi number and file id of the child block.

Once we have both data. We can find the unique match of 1 and 2 in python.

If you think this is something you guys want to approach, I will test if we can get all the lumi numbers for a big dataset in 5 minutes.

Yuyi

vkuznet commented 4 years ago

I really don't know if proposed solution will sustain since it seems to me that sooner or later you'll again hit the limit of 5 min. This I consider as a temporary fix but it does not solve the problem. If DB can't handle the load with parentage they way DMWM queries it we can't fix it in APIs, we either need to do something on DB level to speed up those queries, e.g. run procedure function to generate this info in background, re-factor DB to handle parentage use-case, or move/use other solution (HDFS) to get parentage.

But before that I rather prefer to see full description WHY do we need to support this use-case. From ticket description it is unclear why DMWM needs this, apart that it is heavily rely on this.

amaltaro commented 4 years ago

Yuyi,

I'm not sure I followed your suggestion. Are you saying that we could:

And fix the parentage relationship on the application side, then injecting a list of parent file ids for a given child file id? If this is what you're saying, then we should likely use block level operations to avoid eating the whole memory when parsing it.

Valentin, the problem has been reported here: https://github.com/dmwm/WMCore/issues/8269

and this is how we decided to solve StepChain parentage handling; because things happen asynchronously, meaning that we could insert into DBS data for a NANOAODSIM dataset, while the AODSIM hasn't even been merged yet.

I don't discard having the posssibility to fix it withing WMAgent, but I'm pretty sure that will be a substantial change and it can't be done within a few weeks.

yuyiguo commented 4 years ago

Valentin, Alan,

What I proposed was that a temperate solution to fix the existing data that already in DBS.
The questions are:

  1. When the problem will be fixed in WMAgent so that we will not have to fix the parentages any more?
  2. How many blocks do not have file level parentage in DBS?

When I proposed that solution, I thought that

  1. The WMAgent will be fixed in next a few month. If this is not the case, what is the schedule?
  2. The number of blocks have missing file parentage is vert small. That was why no one found it after many years.

Maybe we should discuss more on the problem before offering any fixing. Thanks, Yuyi

amaltaro commented 4 years ago

Yuyi, it actually isn't/wasn't in our plans to modify how the StepChain parentage works because I wasn't aware of such limitations. I can try to find out the number of blocks without parentage information, but it's likely around hundreds of blocks (hopefully not a few thousand).

I think the question back then when this issue was being discussed was: how much work would it be to get such problem fixed in WMCore and in DBS. Which is still a valid question. As I said before, we can certainly have a fix for it in WMAgent, likely not a clean solution, and likely something that will take us weeks of development; not impossible though.

yuyiguo commented 4 years ago

Alan, If I understood you correctly, WMAgent/stepChain will always send broken/partial data to DBS? In other words, we are looking for solutions to fix the data in DBS instead of fixing the source of the problem.

WMAgents have their own databases, Can the partial data wait in its local db? WMAgent insert NANOAODSIM dataset into its local database while waiting for AODSIM is merged. Then upload the completed dataset/block into DBS?

amaltaro commented 4 years ago

Your understanding is correct, Yuyi. The merging step is asynchronous, and so is data injection against phedex/DBS. So yes, StepChain output will likely be always broken.

Yes, we can definetely implement what we need in the agents. However, I'd like first to explore a DBS/database-side option, if we still can find one.

FYI @todor-ivanov

yuyiguo commented 4 years ago

Alan, How you fix the data in DBS now? Do you have a cron job running in some frequency to fix it?

We are going to partition DBS files and file_lumis tables. I think the queries against these two tables will be improved after the partition. But the partition will take some time. How long can you wait? Kate is out of office this week. I will discuss with her the partition schedule next week.

amaltaro commented 4 years ago

We have a cherrypy thread running every 3h: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py

even though it looks like the current cycle is running since the cmsweb production upgrade (it's has been implemented sequentially). Do you think it could hurt DBS too badly if we fix the parentage with X concurrent requests against DBS (like 10 blocks concurrently)?

I believe it would be okay to wait for a few weeks, but we might be unlucky and have users looking into those samples and their parents, which will increase the priority to get it fixed.

I'm also about to leave on vacation (day after tomorrow).

vkuznet commented 4 years ago

Alan, if single query takes 5 min or more, then your concurrent access will be worth since query will use transactions and all concurrent queries will block DBS.

Even though a temp solution may be put in place you need to fix the root of the problem. From what I read and understood in this ticket the proper solution should be done in wmcore and not in DBS code.

Sent from ProtonMail mobile

-------- Original Message -------- On Feb 11, 2020, 11:01 PM, Alan Malta Rodrigues wrote:

We have a cherrypy thread running every 3h: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py

even though it looks like the current cycle is running since the cmsweb production upgrade (it's has been implemented sequentially). Do you think it could hurt DBS too badly if we fix the parentage with X concurrent requests against DBS (like 10 blocks concurrently)?

I believe it would be okay to wait for a few weeks, but we might be unlucky and have users looking into those samples and their parents, which will increase the priority to get it fixed.

I'm also about to leave on vacation (day after tomorrow).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

amaltaro commented 4 years ago

I had the impression that select/join statements wouldn't block tables; we are also not updating anything within the same transaction, so it might not hurt the database performance. Yuyi, can you please confirm that?

vkuznet commented 4 years ago

But if select requires full table scan it means that N queries will stale. Anyway, it's better that Yuyi confirm.

Sent from ProtonMail mobile

-------- Original Message -------- On Feb 12, 2020, 8:22 AM, Alan Malta Rodrigues wrote:

I had the impression that select/join statements wouldn't block tables; we are also not updating anything within the same transaction, so it might not hurt the database performance. Yuyi, can you please confirm that?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.