dmwm / DBS

CMS Dataset Bookkeeping Service
Apache License 2.0
7 stars 21 forks source link

Provide insertFileParentages(block_name, child_parent_id_list) #570

Closed ticoann closed 6 years ago

ticoann commented 6 years ago

block_name: block name - string child_parent_id_list: [(cid1, pid1), (cid2, pid2), ... (cidn, pidn)] - list tuple of child parent id

i.e insertFileParentages([(cid1, pid1), (cid1, pid2), (cid2, pid2), ... (cidn, pidn)], childBlockName)

prerequisite: all the cids are from childBlockName and all files from that block without missing files. returns nothing.

unittest: compares inserted data and data which are retrieved back raise exception: if the patial parentage is inserted.

yuyiguo commented 6 years ago

Please provide your use causes and detail requirement.

ticoann commented 6 years ago

Usecase: For StepChain workflow there is no easy way to get the parentage between files. https://github.com/dmwm/WMCore/wiki/StepChain-Parentage

Related to #568, #569, When parentage of the child files are discovered, we need to able to insert that parentage to DBS.

Requirement: Since missing parentage will be in whole dataset this will require bulk insert for performance reason. We can either provide list of bind variables shown above for bulk insert. However corresponding block parentage need to be inserted as well. It can be automatically figured in DBS API but that might be causing performance hit. Otherwise we can provide block parentage separately. But validation on that might be tricky.

bbockelm commented 6 years ago

Hi,

What's the plan / timeline for this?

Brian

@vlimant - you may want to follow this one.

ticoann commented 6 years ago

:-) @yuyiguo, you can delete above comments then.

yuyiguo commented 6 years ago

The proposed API: insertFileParentages([(cid1, pid1), (cid2, pid2), ... (cidn, pidn)]) This API will take the output from the API described in https://github.com/dmwm/DBS/issues/569.

The API will use the file parentage check the existing dataset parentage and will report error when they are not match. It will also update the block parentages too.

We will deal these parentages block by block. So we expect that WMAgent will send a block of data to DBS each time they call these APIs.

ticoann commented 6 years ago

I think we should send the child block information with the call as well. if there is restriction on the method all the child files from same block.

yuyiguo commented 6 years ago

Then the API will be insertFileParentages([(cid1, pid1), (cid2, pid2), ... (cidn, pidn)], childBlockName).

yuyiguo commented 6 years ago

insertFileParentages([(cid1, pid1), (cid2, pid2), ... (cidn, pidn)], childBlockName) The unit tests should include both client and server side tests.

  1. Negative tests: either parentage relationship [(cid, pid),(cid2, pid2)...()] or childBlockName is missing, then the tests should fail.
  2. If childBlockName is not a str, it should fail.
  3. if a relationship is not a list, it should fail.
  4. input: a list of file parentage and childBlockName. construct a new dataset/block without file parentage in the data given to DBS and keeping the file parentage aside for later, but has dataset parentage in the data. insert the dataset/block using bulk block insert API and searching file parentage using listFileParents(listofchildLFNs, listofParentDatasets) and insert them with this API. Compare the original file parentage with the listFileParents(childBlockName).
ticoann commented 6 years ago

@yuyi what you like to name the first parameter? second one is childBlockName

ticoann commented 6 years ago

It seems all other dbs parameters named not CamelCase but python standard. We should may change the parameters the same way, what do you think?

yuyiguo commented 6 years ago

you are right, @ticoann