dmwm / DBS

CMS Dataset Bookkeeping Service
Apache License 2.0
7 stars 21 forks source link

Provide DBS REST API and client API which can return parent files by matching run lumis: listFileParents(block_name, logical_file_name=None) #569

Closed ticoann closed 6 years ago

ticoann commented 6 years ago

i.e. listFileParents(child_block_name, child_lfn_list=None) If child_lfn_list is None get all the files from the child block. (faster in Oracle) returns [{child_parent_id_list: [(cid1, pid1), (cid2, pid2), ... (cidn, pidn)]}]

REST API is needed for the concurrent call using pycurl.

yuyiguo commented 6 years ago

Please provide your use causes and detail requirement.

ticoann commented 6 years ago

Usecase: For StepChain workflow there is no easy way to get the parentage between files. https://github.com/dmwm/WMCore/wiki/StepChain-Parentage

We need to able to figure it out from child lfn to list of parent lfn for given parent dataset. We can construct combining current dbs apis to get this results, however since this call need to be made for all the files for a given dataset, there is optimize way to handle this. One way to handle is making concurrent call for getting this however current dbs api doesn't support currency for the APIs runs sequentially and need the relations between those api.

Requirement: So we need to have either dbs api runs currently when it takes list of childLFNs and parentDataset. Or need REST api which takes single childLFN and parentDataset, so we can make call that REST api currently using pycurl

yuyiguo commented 6 years ago

As discussed with @ticoann yesterday, We had below decision:

  1. We should use bulk operations to DB in order to speed up DB access. Using pycurl for concurrent call to DB is a bad practice.
  2. We should separate finding parentage from inserting them to keep the API execution time shorter.

The proposed API: listFileParents(listofchildLFNs, listofParentDatasets) It will return a list of File Parentages as[(cid1, pid1), (cid2, pid2) ...(cidn, pidn)]

Note CMS LFN is a very long string so we will not return matching LFN pairs, but database ids. This API is not meant for humans to read what it returns, but for the insert API to upload parentage to the db. The insert API should call right after this one to insert to the same DB this one is connected. The list and insert APIs have to call in sequence and pairs.

This API will use run-lumi info from the child LFN to match the run-lumi of the list of parent dataset. There may have wrong matching because the same run-lumi info may show in multiple parent files. There is no way for DBS to know what were the real parent files. Is this acceptable to the physicists, @bbockelm ?

ticoann commented 6 years ago

isn't listFileParents already exists? listFileParentsUsingLumi? Also we need equivalent RESTApi

yuyiguo commented 6 years ago

Different signatures. RESTApi should be the same.

amaltaro commented 6 years ago

Seangchan, Yuyi

The proposed API: listFileParents(listofchildLFNs, listofParentDatasets) It will return a list of File Parentages as[(cid1, pid1), (cid2, pid2) ...(cidn, pidn)] Note CMS LFN is a very long string so we will not return matching LFN pairs, but database ids.

given that the output of listFileParents api isn't human readable, I think the best would be to re-purpose it such that it finds and fixes the parentage files. I understand it will then become a very slow api (better if we had timing numbers for the find and the insert actions), but maybe we could then restrict it to a block level operation only (one block at a time)?

In addition to that, I'm concerned about the integrity of the DBS data. Would this API allow to overwrite a parent file? Or it only allow changes to children with NO parent files?

I know it's a substantial more work, but it might be a good idea to have a specific DBS role for such operations. I'd say any insert or data changes should have a different DBS role than the one used for changing dataset and file status. That would be a good way to protect data from human and other scripts/tools mistakes.

yuyiguo commented 6 years ago

@amaltaro All operations we talked in this group of APIs are block based APIs. We will keep the searching and insert separately for now.

We are very short on time and man power right now, so I'd like to focus on the current issues. If we agree to redesign DBS permission system, we need to think through and have a plan.

yuyiguo commented 6 years ago

Unite tests should include both client side tests and server side tests. API listFileParents(list0fChildLFNs, listofParentDatasets)

Negative tests: if missing any of the inputs, the API should failed. input: single item of LFN list and single dataset. input: long list of LFN and single dataset. input: long list of LFN and long list of datasets. All input tests should compare the result with listFileParents(logical_file_name). The test fails when they are different.

yuyiguo commented 6 years ago

Discussed with Seangchan, we will restrict the missing file parentage is from single block. So The updated API is listFileParents(listofChildLFNs, listofParentDatasets, childBlockName) where the listofChildLFNs are the missing parentage ones.

amaltaro commented 6 years ago

Discussed with Seangchan, we will restrict the missing file parentage is from single block. So The updated API is listFileParents(listofChildLFNs, listofParentDatasets, childBlockName)

Yuyi, Seangchan, I'm afraid I'm a bit lost in here... Isn't the use case the following:

IF this is the correct use case, why do we care about listofChildLFNs and listofParentDatasets? Is this design considering that in the future we might have a file with 2 different parent datasets?

Last but not least, just a minor suggestion, how about calling it listParentFiles instead? It's more clear IMO.

yuyiguo commented 6 years ago

Alan:

We have this API because we want to find out the parent files for these listOfChildLFNs. In most case, all the file parentage is missing for the childBlockName, there is a small portion only part of the files has no parentage so we need to list the missing child files.

The API returns only the (childFile parentFile) that does not have parentage in DBS.

According to Seangchan, the parent files could be in more than one datasets, so it is a list.

we know the child file and try to find who are the parents. Based on DBS API naming convention, it is listFileParents. We already have APIs called listFileParents(blockname), listFileChildren and so on.

ticoann commented 6 years ago

I just discussed with Alan about the issue. Although agent doesn't have restriction on multiple datasets as input, We never had that case. So I think it would be OK to change to listofParentDatasets to one parentdataset. Also parentdataset can be easily obtained from DBS call when you know the child block name we might make that optional as well. Yuyi, we can discuss this in more detail tomorrow.

yuyiguo commented 6 years ago

Sounds good to me.

From: ticoann notifications@github.com Reply-To: dmwm/DBS reply@reply.github.com Date: Thursday, July 12, 2018 at 3:50 PM To: dmwm/DBS DBS@noreply.github.com Cc: Yuyi Guo yuyi@fnal.gov, Comment comment@noreply.github.com Subject: Re: [dmwm/DBS] Provide DBS REST API and client API which can return parent files by matching run lumis: listFileParents(listofchildLFNs, listofParentDatasets, blockName) (#569)

I just discussed with Alan about the issue. Although agent doesn't have restriction on multiple datasets as input, We never had that case. So I think it would be OK to change to listofParentDatasets to one parentdataset. Also parentdataset can be easily obtained from DBS call when you know the child block name we might make that optional as well. Yuyi, we can discuss this in more detail tomorrow.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_569-23issuecomment-2D404646323&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=OIPhNHI2CrcLlm1OEN2wECQN_waEqFLC00oBW6_VzGg&s=272zYPxw-RA4kEalT9AGyywkP7RhV-SxbGskiBu6fXM&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABsXTvCm9UCf1FtnCLxv8v3W1qZAkidVks5uF7aUgaJpZM4U9N58&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=OIPhNHI2CrcLlm1OEN2wECQN_waEqFLC00oBW6_VzGg&s=YDdRWAt9lYcFjCrThNilu3YJclj07YJOp_0-evPbPIA&e=.

yuyiguo commented 6 years ago

The new signature: listFileParents(block_name, logical_file_name=None) DBS internal validation requires the same names.