dmwm / CRABServer

15 stars 38 forks source link

disable publication for non-VALID input datasets #7334

Open belforte opened 2 years ago

belforte commented 2 years ago

users can (partially) process datasets which are still in production (!?) via

config.Data.allowNonValidInputDataset = True

see e.g. https://cmsweb.cern.ch:8443/scheddmon/0197/rkansal/220705_155318:rkansal_crab_pfnano_v2_3_2017_HZJ_HToWW_M-125/debug/crabConfig.py which triggered the mail exchange [1] with Alan, Yuyi and Valentin

But it makes not sense to try to publish output in DBS since parentage info is not available for PRODUCTION dataset and things will end up with an endless error loop inside Publisher

[1]

Yuyi,
thanks for clarification, I already applied the required change to migration
server and it will not accept requests from clients if dataset is not in a VALID
state.
Valentin.

On  0, Yuyi Guo [<yuyi@fnal.gov>](mailto:yuyi@fnal.gov) wrote:
>    Thanks Alan for the explanation. I don't see a use case for  migrating
>    an incomplete block/dataset.
>
>
>    Valentin, This was the first I heard about this "error". No one should
>    touch a block/dataset in production status except for the data
>    processing group.
>
>
>    Cheers,
>
>    Yuyi
>
>
>    From: Alan Malta Rodrigues [<alan.malta@cern.ch>](mailto:alan.malta@cern.ch)
>    Date: Monday, July 11, 2022 at 5:59 AM
>    To: Valentin Y Kuznetsov [<vkuznet@protonmail.com>](mailto:vkuznet@protonmail.com), Yuyi Guo
>    [<yuyi@fnal.gov>](mailto:yuyi@fnal.gov)
>    Cc: Stefano Belforte [<stefano.belforte@gmail.com>](mailto:stefano.belforte@gmail.com), [klannon@nd.edu](mailto:klannon@nd.edu)
>    [<klannon@nd.edu>](mailto:klannon@nd.edu), Diego Ciangottini [<diego.ciangottini@cern.ch>](mailto:diego.ciangottini@cern.ch), Todor
>    T. Ivanov [<todor.trendafilov.ivanov@cern.ch>](mailto:todor.trendafilov.ivanov@cern.ch)
>    Subject: RE: weird migration use-case (missing block parentage for
>    existing dataset one)
>
>    Hi Valentin,
>    I can explain why there is no parent information for:
>     /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20U
>    L17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM#92f25318-1797-43ea-
>    a01e-02fda4b18908
>    and the reason is that this dataset is under production right now,
>    meaning that there is an active
>    workflow (running-open status) still writing to it.
>    In addition to that, it's a StepChain workflow. Their parentage
>    information is only performed once the
>    workflow moves to "close-out" status (basically getting announced).
>    I guess one can say that migrating a growing dataset between DBS
>    instances isn't really a valid
>    use case, since the migration acts on a snapshot of the dataset...
>    Cheers,
>    Alan.
>    ________________________________________
>    From: Valentin Kuznetsov [[vkuznet@protonmail.com](mailto:vkuznet@protonmail.com)]
>    Sent: Sunday, July 10, 2022 6:22 PM
>    To: Yuyi Guo
>    Cc: Stefano Belforte; Alan Malta Rodrigues; [klannon@nd.edu](mailto:klannon@nd.edu); Diego
>    Ciangottini; Todor Trendafilov Ivanov
>    Subject: weird migration use-case (missing block parentage for existing
>    dataset one)
>    Yuyi,
>    during debugging process of new Go-based migration service [1] we found
>    one
>    weird use-case which I would like to understand.
>    The following block
>    /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL
>    17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM#92f25318-1797-43ea-a
>    01e-02fda4b18908
>    has no parents in DBS, but its dataset
>    /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL
>    17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM
>    does have a parent dataset
>    /HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL
>    17RECO-106X_mc2017_realistic_v6-v1/AODSIM
>    How this is possible? Does this case represent some "failure" or
>    missing data in
>    DBS or it is a real use-case. According to dataset details [2] it was
>    created
>    on 1647603603 UNIX time which translates into Mar 18th of 2022, see
>    ```
>    time.gmtime(1647603603)
>    time.struct_time(tm_year=2022, tm_mon=3, tm_mday=18, tm_hour=11,
>    tm_min=40, tm_sec=3, tm_wday=4, tm_yday=77, tm_isdst=0)
>    ```
>    The new DBS Go writer was put into production by May 17th (see slide 12
>    in [3]),
>    and it means that originally it was inserted into DBS using Python DBS
>    server.
>    Therefore, the logic of insertion comes from DBS Python server.
>    As such, I need to understand this use-case in order to make proper set
>    of
>    actions. Either we need to add block parent, or remove dataset parent
>    or adjust
>    logic of migration server to account for such use-case(s). But for that
>    it would
>    be very useful to understand this specific use-case and how we end-up
>    with it.
>    Thanks,
>    Valentin.
>    [1]
>    [1]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm
>    _dbs2go_issues_53&d=DwIFAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQ
>    MBG2Q&m=Q66capO7HuiUOhppidUdfPob2CsDOdzAafuTwL7OnmQC2jPboPILFrVoIf2y_Tp
>    q&s=quEk3b3n0ImTb7XikH1vSDO-YvSt3uuVX8RZVpKmZ2Y&e=
>    [2]
>    [2]https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datasets?dataset=/H
>    ZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL17
>    RECO-106X_mc2017_realistic_v6-v1/AODSIM&detail=true
>    [3]
>    [3]https://indico.cern.ch/event/1157140/contributions/4858857/attachmen
>    ts/2437408/4174867/220504%20-%20O%26C%20Weekly%20News.pdf
>
> References
>
>    1. https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_dbs2go_issues_53&d=DwIFAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=Q66capO7HuiUOhppidUdfPob2CsDOdzAafuTwL7OnmQC2jPboPILFrVoIf2y_Tpq&s=quEk3b3n0ImTb7XikH1vSDO-YvSt3uuVX8RZVpKmZ2Y&e=
>    2. https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datasets?dataset=/HZJ_HToWW_M-125_TuneCP5_13TeV-powheg-jhugen727-pythia8/RunIISummer20UL17RECO-106X_mc2017_realistic_v6-v1/AODSIM&detail=true
>    3. https://indico.cern.ch/event/1157140/contributions/4858857/attachments/2437408/4174867/220504 - O&C Weekly News.pdf
belforte commented 2 years ago

note to myself: publication for a task is controlled in the schedd via the classAd CRAB_Publish which is set in DagmanCreator based on tm_publication value. DBS status of input dataset is checked in DBSDataDiscovery. One easy way could be to override the value of tm_publication in DB inside DBSDataDiscovery, need to check if we have an API for that. Drawback: db info will not match what's in crab config which may be puzzling for future debuggers. Less appealing is to check dataset type again in DagmanCreator, since DBS queries do not belong there.

Maybe it is enough to override the task object content in DBSDataDiscovery w/o touching the DB ? Maybe changing DB value would be irrelevant anyhow ?

TO BE TESTED

belforte commented 1 year ago

time to fix this, since it now happens and annoy us in production server, see https://mattermost.web.cern.ch/cms-o-and-c/pl/98zp9hw893rtuceb731f8ins5e

rising priority

belforte commented 1 year ago

there is no API to override the value of tm_publication in DB in https://github.com/dmwm/CRABServer/blob/master/src/python/CRABInterface/RESTTask.py . Let's try first for a solution which does not require to deploy a new REST server

belforte commented 1 year ago
belforte commented 1 year ago

the trick of overwriting in-memory task object with

kwargs['task']['tm_publication']='F'

in here https://github.com/dmwm/CRABServer/blob/8c51e4a5de68531591e686ddeb47b5ab0fe33325/src/python/TaskWorker/Actions/DBSDataDiscovery.py#L33-L43 works. But the fact that DB flag still is "publication on" leads to confusing crab status output and overall things will look inconsistent. I could add a Warning. But a cleaner solution would be better. Let's investigate adding an API to change the flag. It is a bit more work but should be straightforward.

belforte commented 1 year ago

@mapellidario @novicecpp in the spirit of what was said in last meeting about "hand over to you issues which you can deal with and do not require extensive knowledge", is this additional API something one of you feels like doing ? Adding a new API is a bit tedious, but should be possible to go by slightly modifying an existing one which modifies some other column. If there's interest I can walk you through the steps.

belforte commented 1 year ago

hmm.... now those tasks which try impossible publications are not "harmful" anymore, they simply fail publication w/o reporting reason to user. We can lower priority

belforte commented 4 months ago

SHort term quick solution: