Allow blocks with partially resolved parentage information

todor-ivanov commented 1 year ago

Impact of the new feature There could be many groups affected by such a change.

Is your feature request related to a problem? This is a request for a feature change which is related to a recently found custom use case involving miscommunication between WMCore and DBS. Here is the related WMCore issue and description of the situation: https://github.com/dmwm/WMCore/issues/11260#issuecomment-1460044696

Long story short, we need to allow uploading blocks with partially resolved parentage information. This means few of the files present in a block may have no relation to files from the parent dataset (missing or partially resolved parentage information).

The main question here is eventually to Core SW and Physics groups as well - How mandatory is this constraint and How badly would such a change affect other steps?

Describe the solution you'd like

Solution #1: We simply remove the check for fully resolved parentage information for all files in a given block, here: dbs/fileparents.go#L344-L358
Solution #2: We allow files from the child block to refer to a NO parent file by setting the parent file id to a predefined default value (in accordance with the DBS data structures), which should be an indication that these files are missing the upper level files which they have been produced from.

Describe alternatives you've considered

The alternative approach wold be for WMCore to start developing a mechanism for automatically invalidating all those files from DBS and Rucio, which will go silently in the background. So far we have no numbers to cite about how frequent this is and that's why we cannot tell if and to what extent of data reduction it could lead. But in any ways, we are not in favor of this alternative path.

Additional context None

vkuznet commented 1 year ago

@todor-ivanov , thanks for creating the issue. But I think neither solution is acceptable. For the solution 1 I do not know how it will impact other use-cases. For the solution 2 I do not know how to correctly do that in ORACLE DB, what is a default and why we should have it in file parent table. I think you are looking for solution 3:

client provides list of chiild, parent ids, e.g. [(1,null), (2,3)]
the first pair indicate that there is no parent for child id 1
we adjust DBS code to skip this pair during injection to DBS file parents table
since there is no parent record in file parent table for child id 1 then anyone who will look-up this lfn will no there is no parent for it.

How does it sound to you? @amaltaro please provide your input too.

todor-ivanov commented 1 year ago

Thanks @vkuznet, Actually from your comment I can see we are in coherence on the way how we should modify the logic. I simply failed to clarify that the change should happen through a proper change of the data structure in the client call. But yes, we are indeed talking about the same thing. Thanks for clarifying this. And I agree this would be the minimal change on the WMCore end.

@amaltaro, what is your opinion on this?

vkuznet commented 1 year ago

@todor-ivanov , please provide additional information which DBS API is affected by this request and provide an example of HTTP POST input payload. For DBS API please inspect the corresponding python client to identify which DBS API it is using, here is DBS API documentation which lists all DBS APIs.

amaltaro commented 1 year ago

Valentin, Todor, this discussion is in agreement with what we discussed yesterday. Just to summarize:

we need to make minor changes to WMCore such that the child/parent tuple is provided to the DBS client API, but with a None/null refeference to the parent id. Make sure there is a log record for these cases.
we need to then pre-process this data on the DBS Server side, removing tuples pointing to a None/null parent file id. Such that no action is taken for that tuple.

For the record, the only - expected - users for this API is WMCore, especifically that ReqMgr2 cherrypy thread.

vkuznet commented 1 year ago

I can work on this if @klannon approve my time allocation for this task, or we can delegate this task to @d-ylee . Meanwhile here I provide list of action items which should be done within DBS codebase:

the affected DBS API is fileparents and it comes from fileparents.go module
we need to loose constrain on FileParents data structure, see https://github.com/dmwm/dbs2go/blob/master/dbs/fileparents.go#L63 and remove validate parameters for PARENT_FILE_ID such that it can accept null
the validate function should be adjusted to skip validation of PARENT_FILE_ID if it is not provided, but if it is provided it should still perform the validation.
the insert function should check PARENT_FILE_ID and if it is not provide just exit immediately
and, finally we need to provide new unit test in int_fileparents.go test file to test insertion of pair where file parent id is null.

amaltaro commented 1 year ago

@d-ylee Hi Dennis, did you manage to check the feasibility of the developments listed above? Please let us know if further clarification is required here.

d-ylee commented 1 year ago

@amaltaro I think the discussion above is clear to me. I can work on this after I get the updating physics group name completed. Thanks!

d-ylee commented 1 year ago

@vkuznet After looking through the code, I noticed that for the database schema: https://github.com/dmwm/dbs2go/blob/928dc255e5a695546767c363c5d0fb9b37cd6fd8/static/schema/DDL/create-oracle-schema.sql#L866-L870

FILE_PARENTS has FILE_PARENT_ID as not null, and also has FILE_PARENT_ID as part of the PRIMARY KEY constraint. Would it be an issue if the CONSTRAINT contains a NULL value?

vkuznet commented 1 year ago

Dennis, it is good question and in fact should be answered by @amaltaro , @todor-ivanov . For instance:

if we loose constrain in FILE_PARENTS table then we can have entry in it for file ids which does not have parents
if we will keep constrain then we'll only keep files with parent information.

So, the two approaches will reflect how we'll query the data and how we will interpret the results of such query. For example, if we stick to approach 1, then our fileparents API will return a file id and not parent id because file id will be recoreded in FILE_PARENTS with no parent id. But if we choose second approach then fileparents will not even list such file id, but it will means that it has no parent.

I suggest that @amaltaro , and @todor-ivanov should evaluate both pros and cons and side effects of this decision.

amaltaro commented 1 year ago

From the DBS perspective, would there be any impact and/or performance degradation if we allow NN_FP_PARENT_FILE_ID to be Null?

When using the fileparents API, this is an example of call: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/fileparents?logical_file_name=/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root

which returns the following data

[
{"logical_file_name":"/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root","parent_file_id":57020800,"parent_logical_file_name":"/store/mc/Fall13/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/GEN-SIM/POSTLS162_V1-v1/30000/AEF4E42B-1462-E311-8D2E-0025901D48B2.root"}
,{"logical_file_name":"/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root","parent_file_id":57020803,"parent_logical_file_name":"/store/mc/Fall13/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/GEN-SIM/POSTLS162_V1-v1/30000/A4668E29-1462-E311-AEB5-1CC1DE18CFEA.root"}
]

do I understand it right that, in case 1) above, we would get an output data as (just an example):

[
{"logical_file_name":"/store/mc/Phys14DR/WprimeToENu_M_4800_Tune4C_13TeV_pythia8/AODSIM/TEST_PU20bx25_PHYS14_25_V1_TEST_Alan-v9/00000/3E27F2B3-D975-E411-8793-02163E010EE6.root","parent_file_id":none,"parent_logical_file_name":""}
]

while in case 2) the output response would be an empty list. Can you someone please confirm this behaviour?

vkuznet commented 1 year ago

Having or not parent file id in FILE_PARENTS table will not impact DBS performance as DBS APIs use streaming. But allowing null parent will increase table size. By how much it depends on how may null parents we will have for our files. Said that, your understanding of output of fileparents API is correct, for case 1 we'll have [{'logical_file_name': 'bla', "parent_logcial_file_name': null}], while in case 2 you'll get empty list since no record for file bla will be present in FILE_PARENTS table.

todor-ivanov commented 1 year ago

Hi @vkuznet @d-ylee @amaltaro,

To me option 2) is a NoGo. It is quite confusing and introduces more complexity and confusion rathar than simplicity. This is exactly the type of solutions relying on hidden logic and unwritten policies and agreement which are to be well forrgotten in really near future and the knowledge left to be forwarded through the generations only by the means of spoken wisdom. That will make the code from both sides quite unreadable only in a year from now. I am much more for the solution 1), where the parrent is explicitly pointed as NULL, meaning that DBS simply has no information for the origin of the child file, but does have all the rest of the information related to this file. Plus, IIUC, in many many cases it is much more important for physics to find the file itself.

d-ylee commented 1 year ago

@vkuznet @amaltaro @todor-ivanov

If we go with 1), would we need to update the database schema for the FILE_PARENTS table to

Allow NULL value for PARENT_FILE_ID
Redefine the PRIMARY KEY to not use PARENT_FILE_ID

For the PRIMARY KEY, I am still not clear how we should define this

From a past message:

1. if we loose constrain in FILE_PARENTS table then we can have entry in it for file ids which does not have parents

Would we need to have a entry in the FILES table that is for no parent files, but when getting from /fileparents, return none in the parent_file_id field if this entry ID is referenced?

Also, would there be cases in which allowing null parentage for FILES also affect DATASET_PARENTS and BLOCK_PARENTS?

vkuznet commented 1 year ago

Dennis, everything can be much simpler than you describe. How about using -1 for PARENT_FILE_ID when it is not provided? This way, we do not need to adjust schema, the -1 is acceptable integer and we may use it in other tables and APIs to indicate of no parent presence.

todor-ivanov commented 1 year ago

hi @vkuznet @d-ylee

The negative value for missing parent file id is a valid workaround, which would prevent the need of table redefinition. But we need to be sure what is about to be returned to the client as well. Because, once recorded as -1 I believe it will be also returned as -1 in subsequent client queries, rather than null, as originally recorded by the client. This is slightly confusing but may work, provided we check if any further change on how we digest the data from those APIs in WMCore would be needed.

Denis, can you list the APIs in DBS which would develop such a silent data type metamorphosis if we opt for this workaround, so that we can check where and when WMCore uses them?

FYI @amaltaro

d-ylee commented 1 year ago

@todor-ivanov

Not sure if this is what you are asking for, but the APIs that are related to parentage are these:

/fileparents
/blockparents
/datasetparents

todor-ivanov commented 1 year ago

Hi @vkuznet @d-ylee

Coming back to this, it does not seem that all those APIs you list in your previous message would be affected by the workaround of returning negative value for missing parent file id. And so let me clarify, what it means affected in this case:

The client records null for the fileId field for a given run/lumi pair with missing parentage information
When later called to provide the file parentage information, the server returns -1 for the fileId field for run/lumi pairs with missing parentage information instead of the originally recorded value of null.

So to me, it seems only the APIs which return information on the run/lumi level granularity would be affected. So I'd bet only on /fileparents, but this was exactly the confirmation that we wanted from you actually.

So in the most straightforward use case, this might affect us is in the client's methods calls to list{ParentDataset,Block}Trio: https://github.com/dmwm/WMCore/blob/31d65166dc62c7251edd52890bc4ac2f0b79d7fe/src/python/WMCore/Services/DBS/DBS3Reader.py#L966

https://github.com/dmwm/WMCore/blob/31d65166dc62c7251edd52890bc4ac2f0b79d7fe/src/python/WMCore/Services/DBS/DBS3Reader.py#L990

So this is the place I think we will have to rethink this mechanism again, but I see no other way to test it other than just making it in testbed .... and check what the server returns and how we aggregate the parentage information.

To me this again would be yet another workaround, which we will have to remember forever.

amaltaro commented 1 year ago

Just an update of this issue as an outcome of the Zoom discussion we just had (Valentin, Dennis and myself), this is the list of action items that we came up with:

1. Dennis: Check if null json gets converted to 0 in Go language.
2. Todor/Alan - IF NEEDED: Update WMCore to provide parent id = 0 instead of None
3. Dennis: Find what are the constraints/validation performed on the DBS Server side.
4. Dennis - IF NEEDED: update DBS Server to deal with partial block information
5. Dennis: create a CERN oracle account
6. Dennis: Get a dump of testbed or production
 * re-create the ownership of the table (or preserve the ownership)
7. Dennis: deploy DBSServer in a test k8s cluster and share the instance url
8. Todor: deploy ReqMgr2 in a test cluster pointing to Dennis DBS url
9. Todor: Inject one of these blocks in this test instance
  * note that ReqMgr2 should not mark a workflow as ParentageResolved = True
10. Todor: Validate the read APIs

amaltaro commented 1 year ago

@d-ylee Hi Dennis, I wonder if you managed to look into the null marshaling process? Such that we can decide whether we will have to make further changes in WMCore or not.

d-ylee commented 1 year ago

After some basic testing, null json values get converted to the nil value for a given type in Go.

package main

import (
        "fmt"
        "encoding/json"
)

type TestStruct struct {
        Test int64 `json:"test"`
}

func main() {
        d := "{'test': null}"
        var t TestStruct
        json.Unmarshal([]byte(d), &t)
        fmt.Printf("%v", t)
}

In this example, the value of t.Test is 0. If the type of TestStruct.Test is string, json.Unmarshal handles it by using empty string.

This behavior is reflected in the documentation: https://pkg.go.dev/encoding/json#Unmarshal

The JSON null value unmarshals into an interface, map, pointer, or slice by setting that Go value to nil. Because null is often used in JSON to mean “not present,” unmarshaling a JSON null into any other Go type has no effect on the value and produces no error.

Null values in JSON are given the nil or 0 value of a given type. For int64, the nil value is 0.

I can do additional testing by creating a simple HTTP server and sending a Python request with a None as the key's value to see if it will have the same result.

amaltaro commented 1 year ago

@d-ylee thank you for these details! I think it is good enough and I understand that WMCore can keep providing a tuple of (int_child_file_id, None) whenever the parent file is gone (and/or invalidated). So no further changes should be expected in WMCore.

Please let us know how your communication with the CERN DBA goes and where you need support on anything on this front.

vkuznet commented 1 year ago

@d-ylee thanks for confirmation. PLEASE keep in mind that None is pure Python data-type and python client MUST provide proper JSON, e.g.

>>> import json
>>> json.dumps({"bla":None})
'{"bla": null}'

So, on a wire, i.e. when HTTP request is send, the python code MUST provide null and not None, those are two different things. The null is null value in JSON standard, while None is Python none data-type.

@amaltaro , @todor-ivanov you must ensure that whenever Python code provides (int_child_file_id, None) it should be properly converted to JSON data format via json.dumps call, otherwise we'll get a problem on server side.

amaltaro commented 1 year ago

@vkuznet yes, the python object is properly encoded as a json object. So no problems on this front are expected.

d-ylee commented 1 year ago

I did a request from Python in the same way as what @vkuznet detailed using json.dumps on a Python dict with key that has the value of None. The result is the same: Go decodes/unmarshal this as 0 for int64.

amaltaro commented 1 year ago

@d-ylee Hi Dennis, it's been 2 weeks since we had our last chat, and I believe we only managed to cover points 1 and 2. from this comment so far: https://github.com/dmwm/dbs2go/issues/94#issuecomment-1561421999

do you have any updates on the other points? Do you know have a DBS dump such that we can try things in a dev environment?

In addition to that, can you please share what the current constraints are for updating block/file parentage data? It might be that Valentin already provided it in one of the tickets we discussed this, but I could not find it.

It's important to mention that this issue is currently keeping data unnecessarily unlocked in production, so we better fix it sooner than later.

todor-ivanov commented 1 year ago

hi @d-ylee would you want us to organize another meeting if you need some more information from our side so we can proceed on this? If you think there is some step that we first need to do and is currently blocking the progress, please let us know?

d-ylee commented 1 year ago

I spoke to @yuyiguo about this issue, and she explained to me some of the policies from the past and about this partial parentage issue.

Before DBS3, partial blocks were allowed, and these were able to be inserted into DBS when there were missing blocks. This was because we may have not had all files or we did not have the parent-child relationships yet.

This was an issue because we might forget about these partial blocks and end up not knowing who is the parent.

Yuyi mentioned that the policy in DBS3 was to only allow full blocks. Workflow management was to have its own database to keep unfinished blocks there, then insert into the production database.

If we want to change this policy, we need to have it written down and record who is responsible to complete the block.

Another concern is when a child has multiple parents. The current schema in the BLOCK_PARENTS database is the PARENT_BLOCK_ID and THIS_BLOCK_ID (the child). These two columns have a unique constraint. This means that there can't be more than one -1 allowed for a single child. If there are multiple parents with the same children and we use the -1 logic, the UNIQUE constraint would be broken.

Another question I have is, do you have the location of the written DBS3 design policies?

What are your thoughts?

amaltaro commented 1 year ago

Hi Dennis,

Yuyi mentioned that the policy in DBS3 was to only allow full blocks. Workflow management was to have its own database to keep unfinished blocks there, then insert into the production database.

this is still the case. WMAgent keeps all the information in its local database, and inserts the full block into DBS Server with a single call. The only difference is that some files might not have any parent.

The current schema in the BLOCK_PARENTS database is the PARENT_BLOCK_ID and THIS_BLOCK_ID (the child). These two columns have a unique constraint. This means that there can't be more than one -1 allowed for a single child.

I am afraid I didn't follow this, sorry. Blocks can (and do) have multiple parent blocks, it is a Many to Many relationship.

Another question I have is, do you have the location of the written DBS3 design policies?

I don't, sorry. The best I can find is this contribution with a database schema in the slide 20: https://indico.cern.ch/event/214784/contributions/1512637/

@d-ylee @yuyiguo would you be available for a quick Zoom chat today? Or perhaps in the beginning of the next week.

yuyiguo commented 1 year ago

@amaltaro @d-ylee Sure, I can have a Zoom chat this afternoon. I have a meeting from 1:15-2:30ish. Do you wan to do it at 3 pm?

d-ylee commented 1 year ago

@amaltaro @yuyiguo I can also meet at 3PM

amaltaro commented 1 year ago

@amaltaro @yuyiguo 3pm FNAL time works for me as well. I am emailing you now with further coordinates. Thanks!

yuyiguo commented 1 year ago

Great!

amaltaro commented 1 year ago

@d-ylee @yuyiguo thank you for taking the time for this chat.

From our discussion, it looks like we are going to explore the first solution proposed in this issue, this: """ Solution #1: We simply remove the check for fully resolved parentage information for all files in a given block, here: dbs/fileparents.go#L344-L358 """

With that, a few concerns were raised and we need to check the following: 1) whether it is doable to relax the number of files constraint. In other words, change the server such that it allows a list of child files provided by the user to be <= (less or equal than) the actual list of files in the child block (which would be triggered when a file does not have any parent file). 2) check if DBS server is performing any data lookup based on the information provided by the client; or if DBS is actually using everything that is provided by the client. For instance, does DBS server use the parent dataset information; or does it find the parent dataset id based on the list of parent file ids?

Dennis, as we discussed as well, here is an example of a block JSON dump (which might be broken, as we are not able to insert it into DBS): https://amaltaro.web.cern.ch/amaltaro/forWMCore/dbs2go_98/c44686c3-86fa-4fc3-9902-ae328914b317.json

and this is an example of listChildParent data structure, with first item being the child id and second the parent id: https://amaltaro.web.cern.ch/amaltaro/forYuyi/parentage.json

as mentioned in this comment: https://github.com/dmwm/WMCore/issues/11260#issuecomment-1452455362

Please let me know your findings and/or if any extra information is required.

vkuznet commented 1 year ago

@amaltaro , the proper way to solve this issue is to either introduce a new DBS API which will do item #1 in your list or add additional parameter to existing API which end-user must specify for such behavior. This way, we preserve existing API backward compatibility and functionality, and allow user to bypass such constrain in existing API. If we agree on this, then number 2 in your list can do both since new API or existing API using external parameter can perform both actions. To avoid confusion I would favor of creating a new DBS API and therefore re-factor and share code from existing codebase in both APIs. But extra parameter will work too but it will be more confusing in my view.

amaltaro commented 1 year ago

@vkuznet that would be another way to address this issue, even though I would not call it "the proper" way.

The only clients of the dbs-client api insertFileParents is WMCore, which needs to have a mechanism to update children/parent relationship post block injection. If the constraint of number of files is relaxed to <= than the expected files in the block, this WILL still be backwards compatible.

In my opinion, I see no reason to over-complicate this development, which has already being dragged out for many months. As I said before, there is only 1 client for this API and the fix I propose should be pretty straight forward.

vkuznet commented 1 year ago

@amaltaro , good design practices suggests to separate APIs for different use-cases rather to introduce complicated if/else logic. The later was used in Python server code which (if you had a chance to look at it) is very hard to read and understand. Therefore, I expressed my opinion based on this argument. Plus, you never know that one client will only the case. I can easily foresee when we may have other clients, plain script (instead of python code), or CRAB, or any new development we may have in a future. And, even using single client as it is right now you really address different use-cases. For maintenance reason it is better to address them via separate API rather complicate server code (even if we'll need make additional code-refactoring and wait a little bit longer).

amaltaro commented 1 year ago

@amaltaro , good design practices suggests to separate APIs for different use-cases rather to introduce complicated if/else logic.

to me this is exactly the opposite of what you are suggesting. Creating another API just to deal with a slightly different use case IS an over-complication, it is confusing and it causes mistakes on the client side.

The extra query string would be the best approach in my opinion, but that will require changes in:

DBS server side
DBS client side (new release has to be made as well and properly integrated into the WMCore stack)
WMCore

The later was used in Python server code which (if you had a chance to look at it) is very hard to read and understand.

The approach I am suggesting adds NO overcomplication and does not make the code more complicated, does not increase maintenance and is transparent to the end user. It also does not require the release of a new dbs3-client! It literally is a change in the source code from something like:

files_passed_by_client == files_in_block

to something like

files_passed_by_client <= files_in_block

Anyways, I think we will spend the rest of our days disagreeing on this! I'd rather let @d-ylee Dennis consider what is the best approach here, noting that at the moment we have 100s of workflows stuck in the system waiting for this development to be deployed in production.

vkuznet commented 1 year ago

Alan, I would like to take last change to convince you and the team to not add changes to the existing API. My last argument is the following:

by design (both Python and Go implementation) of DBS code provides comprehensive checks and input validation from the client
by using the same API and rely on different input we break the above requirements and open the door to handling input without cross check
if we'll implement the changes the way you propose DBS server will not have such ability to check if given input is right or wrong (i.e. it does or does not have parents) and it will just follow if/else logic without knowing and validating that client provide correct input

In my suggestion this is not possible by design since client will be forced to use either new API or provide external parameter to the API in question. Doing this way client a-priory will do a right job and DBS server will not rely on if/else logic of the input data. You may provide as many arguments as you want that such server implementation is more easy to do but you open a possibility that one day client will (by whatever mistake) provide incorrect input and DBS server will follow if/else logic to insert the data to the database. For instance, someone by mistake will not provide parents, or we may have a bug elsewhere in WMCore which may skip some parent, etc. In either case DBS server will follow the input if we will use the same API signature. To avoid this possibility we must enforce client to be explicit with the call (in other words client will know what it is doing by calling proper API or API with proper parameter rather providing an input only to DBS server).

I hope that you can take this arguments more seriously and properly evaluate the implication of implantation on DBS, but I'll not argue any more about how to implement this logic and leave it up to your and the rest of WMCore team to decide (here I have dual role, I belong to WMCore team and in my view the implementation should have separate API or API+external parameter and I implemented DBS logic). I think Dennis should implement whatever WMCore will agree.

Said that, I tag @todor-ivanov to evaluate my concern and you both should come to conclusion how the implementation should be done. Once you'll reach such agreement I think for Dennis it will be more clear how to implement it. At the end, it is just implementation but my concern that the choice may impact data consistency in DBS database in a long run.

todor-ivanov commented 1 year ago

hi @amaltaro @vkuznet @d-ylee, It is difficult to say here what is the better solution. The arguments and concerns for both cases are far in the future e.g. "are we going to stay the only user of this API", "Will we affect other's code behavior" etc... But as a general rule of thumb I'd not rely on the current situation and conclude for the far future.

One really important thing to take into account when making such a decision is the knowledge if our current policy, for not inserting files into DBS which lack parentage information, is acknowledged and well understood by every other group that may be affected. And here I mean both sides:

Eventual users of WMCore modules or DBS apis directly (including CRAB), and also
Those who need to consume the output from DBS, such as physics

We should not talk only about WMCore and people responsible for DBS server. And this, I believe, is information we do not have at the moment. Obviously there is no policy document for DBS Server behavior..., ence our blindness on the applicability of this policy. There could easily be some proliferation of the reverse opinion among other groups who rely on the data returned by DBS - meaning they may rely on the parent-child relationship in DBS to be satisfied in any case. So we are actually in the state of lack of consensus on the policy itself.

Now about the possible solutions: The fix only on the server side is something that is less work and is an encapsulated change, but because of the above unknowns, I'd bet for the harder one - the one that forces the client to explicitly and consciously consciously stress which policy he would chose for uploading the files. It is not only backwards compatibility that we are talking about here, but also about bearing the responsibility for doing the one or the other. At the end we will not change the API, regardless of which solution we chose here - !!! we only change Server behavior !!!

Creating a new API is out of question - simply not worth it.
Setting an option to the already existing API for choosing the correct policy which would be defaulting to the current one, in my opinion is the way to go.

Yes this is going to affect/require changes from everybody, as @amaltaro said:

DBS server side
DBS client side (new release has to be made as well and properly integrated into the WMCore stack)
WMCore

In the case of WMCore this should be fairly small change - adding the new option here and here The tests and validation procedure, on the other hand, would stay as usually long and need to be a full one.... well it has its price. For the rest, I cannot tell .

Something more. Since this situation (or what concerns WMCore) is an expected one, and our use cases are well identified in advance - as early as building the lists for the parentage resolution itself, I'd not further complicate the code by implementing extra calls to DBS for checking dbs errors and retries. This would only affect code readability and scalability both together without any benefit. So I'd go directly for just changing the policy on the WMCore side by using the newly provided option to the same API.

yuyiguo commented 1 year ago

I just back from my vacation and saw this discussion. Right before @d-ylee and I left for vacation, @amaltaro, Dennis, and I had a Zoom meeting on this. We discussed why the "-1 " would not work with the DBS DB schema. Alan pointed out that the missing parent files would be lost forever, in other words, the file parentage would not be updated in the future. So we concluded that it is better to just not put anything in DBS for the ones that lost their parents' files. I think @vkuznet made a valued argument to protect data integrity. To consider both DBS and WMCore, I would propose below API and server upgrade:

Add missingfiles=0 to the existing API. This would be the default. So no effect on others except for WMcore.
When WMCore calls the API, it tells DBS how many missing files it has with missingfiles=n.
DBS server continues to verify the child-parent relationship. If the missing files does not match n, it will reject the input. This way both client and server know exactly what should happen. This is my two cents.

d-ylee commented 1 year ago

@amaltaro @todor-ivanov If this is what we would like to do, I think we would need to incorporate the number of missing files to the check here: https://github.com/dmwm/dbs2go/blob/928dc255e5a695546767c363c5d0fb9b37cd6fd8/dbs/fileparents.go#L352

I do see that the utils.Equal function also makes sure the file id and the block file ids match. This should probably also be validated during the insert request and may need to be another parameter.

Another question I have is that would DBS need to keep track of these missing files? Or would this be on WMCore's side? Both this question and the above validation seem to affect each other.

Thoughts?

amaltaro commented 1 year ago

@d-ylee Hi Dennis, Yuyi's idea looks good to me and having the client to provide how many files are missing in the json data is a good approach (defaulting it to 0, if not provided).

On what concerns keeping track of the missing files. I would say that the DBS server should at least log that a block with incomplete parentage action was performed (sort of an access log). On the WM side, we do log that a given file failed to find its parentage file.

Please let us know if there are any remaining open questions.

amaltaro commented 1 year ago

@d-ylee Dennis, in addition to what we have discussed today, I just wanted to update this thread with a way to migrate dataset from prod/global do your dev cluster (such that we can test things out):

from dbs.apis.dbsClient import DbsApi
dbsApi = DbsApi(url = 'https://cmsweb-test3.cern.ch/dbs/dev/global/DBSMigrate/')
dbsUrl = 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader'
dataset = "my dataset name"
migrateArgs = {'migration_url': dbsUrl, 'migration_input': dataset}
dbsApi.submitMigration(migrateArgs)

I looked into the dbs3-client code and came up with the following curl standalone call:

curl -k --cert $X509_USER_CERT --key $X509_USER_KEY -vv -X POST "https://cmsweb-test3.cern.ch:8443/dbs/dev/global/DBSMigrate" -d '{"migration_url": "https://cmsweb.cern.ch/dbs/prod/global/DBSReader", "migration_input": "/DM_MonoZPrime_A_Mx50_Mv5000_gDM1gSM0p25_Zprime2p0_TuneCP5_13TeV_madgraph-pythia8/RunIISummer20UL17MiniAODv2-106X_mc2017_realistic_v9-v2/MINIAODSIM"}' -H 'Content-Type: application/json' -H 'Accept: application/json' -H 'UserID: cmst1@vocms0192.cern.ch' -H 'User-Agent: Alan-cURL'

but all my attempts failed with:

< Response-Status: 405 Method Not Allowed

and I cannot figure why. Perhaps @vkuznet has a working example with curl other than this: https://github.com/dmwm/dbs2go/blob/master/docs/MigrationServer.md#examples which relies in a different service/API, localhost and curl command which is not available in the dbs2go images(?)

vkuznet commented 1 year ago

Alan, could you please read closely documentation. It says:

# migration document
cat > m.json << EOF
{
    "migration_url": "https://.../dbs/prod/global/DBSReader",
    "migration_input": "/a/b/c#123"
}
EOF

# submit migration request
curl -H "Content-Type: application/json" -d@$PWD/m.json \
    http://localhost:9898/dbs2go-migrate/submit

while in your request your URL does not specify /submit API.

vkuznet commented 1 year ago

Moreover, each DBS server has its own set of APIs which you can easily find via /apis end-point, e.g. here is list of APIs for DBSMigrate which I took from testbed:

scurl https://cmsweb-testbed.cern.ch/dbs/int/global/DBSMigrate/apis
[{"api":"/dbs/int/global/DBSMigrate/blocks","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/submit","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/remove","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/blockparents","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/serverinfo","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/total","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/bulkblocks","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/dummy","methods":["GET","POST"]},{"api":"/dbs/int/global/DBSMigrate/errors","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/help","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/status","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/datasetparents","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/healthz","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/metrics","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/apis","methods":["GET"]},{"api":"/dbs/int/global/DBSMigrate/process","methods":["POST"]},{"api":"/dbs/int/global/DBSMigrate/cancel","methods":["POST"]}]

Similar /apis end-point exist for reader, writer servers and you can easily see which method each API is using.

amaltaro commented 1 year ago

@vkuznet Valentin, you probably missed a few important points: 1) we are trying to test dbs2go-global-m, which is the service that receives dataset migration requests 2) the server expected API endpoint is dbs/dev/global/DBSMigrate/submit, instead of dbs2go-migrate/submit. Unless you need to run a different service which provides a different set of APIs. If so, then you start deviating from what you actually run in production (given that you are testing a different setup) 3) none of the dbs2go images provide curl, so we need to deploy another service in the kubernetes cluster just so we can use curl to reach a the service that we need with dbs2go-global-m.dbs.svc. Which is inconvenient.

The apis endpoint will be useful though. Thanks

vkuznet commented 1 year ago

Alan, I understand what you are doing. The provided dbs2go documentation does not list our production/test/dev URLS and refers to actual service. Of course I would expect that you know your server endpoint, e.g. /dbs/dev/global/DBSMigrate and use it instead of listed in documentation dbs2go-migrate. The curl should not be inside of dbs2go images. Instead, as I pointed many times just deploy new curl image in your k8s and use it to access any service. Here its manifest file: https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/services/curl.yaml And, you can deploy it as easy as

kubectl apply -f curl.yaml

Then curl pod is created in default namespace and you can login to it and use curl to access your services.

amaltaro commented 1 year ago

Okay, it looks like Dennis has some documentation work to be done at some point, such that a developer can go as closer to the production environment as possible, to have more meaningful tests.

I didn't remember we had a manifest only for curl, that will definitely be handy. Thanks.

@d-ylee Dennis, as we discussed yesterday, I finally got many json dumps for a StepChain workflow that you could play with. Note that they only provide dataset parent information, no file and or block parentage. You can find them here: https://amaltaro.web.cern.ch/amaltaro/forDennis/

Their parentage relationship is:

  "ChainParentageMap": {
    "Step1": {
      "ParentDset": null,
      "ChildDsets": [
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM"
      ]
    },
    "Step2": {
      "ParentDset": "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-GenSimFull_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM",
      "ChildDsets": [
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-Digi_2021_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM-DIGI-RAW"
      ]
    },
    "Step3": {
      "ParentDset": "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-Digi_2021_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM-DIGI-RAW",
      "ChildDsets": [
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/DQMIO",
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/MINIAODSIM",
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/GEN-SIM-RECO",
        "/RelValWToLNu_14TeV/CMSSW_12_4_0_pre2-RecoNano_2021_SC_EL8_Sept2023_Val_Alanv4-v11/NANOAODSIM"
      ]
    }
  },

Please let me know if this enables you to move forward with this development and testing.

d-ylee commented 1 year ago

@amaltaro Thanks for the json dump.

@vkuznet I was going through the fileparent insert API and noticed that we are validating the incoming json twice: https://github.com/dmwm/dbs2go/blob/e71c8a4ef1b6d64058aabee95d6ede53ab163d1c/dbs/fileparents.go#L85

https://github.com/dmwm/dbs2go/blob/e71c8a4ef1b6d64058aabee95d6ede53ab163d1c/dbs/fileparents.go#L368

Was there a reason for this?

To implement the missingFiles, I need to modify this validation step to accept 0 for PARENT_FILE_ID or to handle the second error in the Validate function: https://github.com/dmwm/dbs2go/blob/e71c8a4ef1b6d64058aabee95d6ede53ab163d1c/dbs/fileparents.go#L209-L222

vkuznet commented 1 year ago

Why do you think we are validating twice? Those are two different APIs, the former one is Insert in FileParents

// FileParents represents file parents DBS DB table
type FileParents struct {
    THIS_FILE_ID   int64 `json:"this_file_id" validate:"required,number,gt=0"`
    PARENT_FILE_ID int64 `json:"parent_file_id" validate:"required,number,gt=0"`
}

record. The corresponding json will looks like this:

{"this_file_id": 123, "parent_file_id" : 456}

While latter is from InsertFileParentsBlockTxt APIs which reads FileParentBlockRecord

type FileParentBlockRecord struct {
    BlockName         string    `json:"block_name"`
    ChildParentIDList [][]int64 `json:"child_parent_id_list"`
}

record. The corresponding JSON record will look like this:

{"block_name": "block#123", "child_parent_id_list": [123,456]}

Both records comes in different DBS requests (you may need to figure out which ones). Therefore both require their own validations.

Regarding validation procedure. The 0 is default value if it is not provided in JSON record for given data-type. Therefore, I'm not a big fan to change validation code for it. Instead, I suggested to use -1 to clearly indicate missing parent. I'll leave you to discuss this with Alan.

dmwm / dbs2go

Allow blocks with partially resolved parentage information #94