dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

populate resource table with PSN which doesn't have PNN #6433

Closed ticoann closed 6 years ago

ticoann commented 8 years ago

In first step use the fake PNN to populate wmbs_location_senames table.

i.e. T3_US_OSG doesn't have PNN

ticoann commented 8 years ago
  1. populate wmbs_location table with
  2. update wmbs_location_senames with fake PNN
  3. it should get the threshold info from ssb but not clear that it would be registered there.

Make sure fake PNN is not used for matching location. (site name should be used)

ericvaandering commented 8 years ago

Let's see if Stefan has the same issue with the PNN stuff.

alexanderrichards commented 8 years ago

@ticoann @ericvaandering I have added a pull request that will address the issue of including dummy PNNs. This will essentially pick up any CMSName from the list that firstly matches the regex that we are adding (e.g. T1, T2, all) and that has no PNN associated with it. It will then add a dummy PNN as 'PSN_Dummy'.

This works nicely in my tests but please feel free to have a go yourselves. What further is needed for this issue?

hufnagel commented 8 years ago

Are we sure this fake entry isn't actually used for anything ? And on second question, if it isn't used for anything, why do we need it in the first place ? Just map to None instead of some random string that means nothing.

alexanderrichards commented 8 years ago

And on second question, if it isn't used for anything, why do we need it in the first place.

good question :wink:

I'm not 100% sure on the reason for needing these fakes, @ticoann understands their need better so I'll let him answer.

Just map to None instead of some random string that means nothing.

I can easily enough change this patch to map to None if people prefer? I was merely fixing the issue as presented (i.e. needing a fake PNN).

hufnagel commented 8 years ago

I would prefer None because then if some other code actually tries to use this it'll fail. And we fix it. The danger I see in using dummy values is that some other code just assumes this is real and then we pass this off to external systems, include it in jdl etc. And we get undetermined behavior because these values that these external systems assume exist don't match to anything real.

hufnagel commented 8 years ago

What I really would prefer is to redefine the mapping completely in order to support PSN without PNN. But this could be quite a significant change, maybe mapping to None as PNN and making sure this actually works would be an easier first step...

alexanderrichards commented 8 years ago

What I really would prefer is to redefine the mapping completely in order to support PSN without PNN.

I don't see the distinction between this and your previous suggestion of mapping to None? Unless of course you are suggesting mapping not to None but an empty set/list which I think makes more sense to me. But then I guess it comes down to what these fakes are actually needed for.

ericvaandering commented 8 years ago

Just to be clear, the PNN would be registered as T3_US_OSG_Dummy, right? That's what the PR looks like to me.

alexanderrichards commented 8 years ago

Just to be clear, the PNN would be registered as T3_US_OSG_Dummy, right? That's what the PR looks like to me.

Correct

hufnagel commented 8 years ago

What I meant was that if you go in and would define this from scratch, one of the requirements would be to support PSN without PNN. So basically you know that you can't key off PNN to get PSN. It changes your approach and how you design. But as I said, doing it this way potentially will require some deep changes.

Using None is one way do do it (and the obvious way doing it in the current design), but we might have done it differently. What I am talking here is basically a paradigm shift from looking at PNN as a property of a PSN to a somewhat more flexible relationship between the two. For instance, you might have a PNN in the future with no PSN.

Commenting on technical implementation details: if we use None in the mapping, we won't be able to do a lookup from PNN to PSN. Which I think is fine given that a fake PNN should never be used for a lookup anyways.

Basically, using dummy values is only useful if we propagate them to other parts of the system that use these. But that exactly is also the most dangerous, because the other parts of the system will treat them as real value. If we do this we have to be super careful that these dummy values don't go where they aren't supposed to.

ericvaandering commented 8 years ago

I agree with everything you're saying, @hufnagel . I want @ticoann to weigh in here on a) why this is necessary and b) if, for some reason, each PSN needs its own dummy PNN in the scheme we have.

amaltaro commented 8 years ago

I think the dummy PNN is needed for tracking file x location available in the agent. If there is no PNN, the files available (in the wmbs_sub_files_available) will not have a matching in wmbs_file_location and thus jobs are not created. This rules out the None PNN, unless we make quite a bit of changes.

hufnagel commented 8 years ago

That is just wrong. A site "location" is a PNN, period. It's not a PSN. A PSN is NOT a location. If we really still record location this way we should completely change it.

hufnagel commented 8 years ago

We can either fake things and keep running into these issues because our schema doesn't match reality or we can spend the time to get it right.

hufnagel commented 8 years ago

I thought this was the whole point behind these changes, to get rid of all the fakes and burning tires we had to jump through to make our schema match what is actually deployed in reality. Replacing on way to fake things with another way to fake things doesn't really improve much IMO.

amaltaro commented 8 years ago

Well, I did not say site location was PSN, I know it's PNN. If we want to move forward on this issue, then let's stop with philosophy and instead write some uses cases and how the system is supposed to work (though a new issue I suppose).

My first questions would be:

  1. sites without PSN, where do we run jobs? jobs don't run in PNN...
  2. sites without PNN, where do we stage out files? this is easier, I think, since the site-local0config would decide where to go.
hufnagel commented 8 years ago

Location is PNN and PNN alone. Jobs run at PSN that are "close" to these PNN (via the PNN to PSN mapping from SiteDB). PSN without PNN only ever get jobs assigned to them if we force the workflow to submit jobs there.

Your question 1 doesn't make sense. A "site" doesn't exists per-se, we have PNN and PSN and relationships between them. We used to call a "site" a PSN with an attached PNN, but this is why we are in this mess. That definition works for 95% of our resources, but not for all.

1) If you have data sitting at a PNN without a PSN, you never run any jobs reading that data. Perfectly fine. WMAgent can't do anything with standalone storage, only PhEDEx can.

2) Stageout is defined in the siteconf, you don't have to worry about it. FJR will tell your where the job wrote the data. WMAgent does not need to know this upfront.

hufnagel commented 8 years ago

Basically, the things we need to implement in the WMAgent itself are:

1) Get the PNN to PSN mapping from SiteDB so we can get the list of PSN we should use if the data is present at a given list of PNN.

2) A mechanism to override the PSN (skipping the PNN to PSN lookup).

3) A way to put resource threshold on a PSN.

Nothing else really. Keeping CE and SE and PNN and everything but the kitchen sink in resource control is just historical baggage without any technical need anymore. The FJR tells you what PNN a file sits at, that PNN is all you really need for location bookkeeping.

ericvaandering commented 8 years ago

Of course we do have "sites" but WMAgent needn't concern itself with those. It also, really, shouldn't concern itself with PNNs either except to find the PSNs associated with input data by PNN. I think its clear that if we were doing this from the get-go, every block of a workflow would be stored in our schema with a list of PSNs it could run at. That would be populated by a PNN to PSN lookup from SiteDB at creation time.

I see again Dirk has said much the same as I was typing. :-)

hufnagel commented 8 years ago

Well, yes, sites exist as administrative units. I was just saying for the WMAgent you have PSN (processing resources) and PNN (storage resources) and you don't really (and shouldn't) have to worry about how they relate to each other in an administrative sense apart from the small little thing we care about, which is mapping a list of PNN to a list of PSN a job reading from these PNN can efficiently run at.

hufnagel commented 8 years ago

In other words: resource control does not have to re-create the whole CMS infrastructure map with all the bells and whistles. Resource control puts thresholds (overall and per job type) on PSN. It shouldn't do more if that is not really needed.

hufnagel commented 8 years ago

Btw, this is the ideal (clean implementation) system we would like to have. Might not be practical to do it that way though, given that what we have implemented is quite a bit different. Look at what we have and how we can change it to get close to the ideal. We need something that supports PSN without attached PNN now and if possible we don't want to hack in things to get this done that we would have to remove again later. We don't need to implement something ideal and minimal now.

ticoann commented 8 years ago

The reason I asked Alex to add fake name instead of using None. (Actually we shouldn't even use None. We just don't make the entry on wmbs_location_senames in that case.)

  1. I know that works,
  2. it doesn't require any schema change so we can patch.

But yes, I think Dirk is right, maybe we should do proper fix. Since we already have a hack if we need to add those site in production agent.

Alex, could you update your patch, so wmbs_location table is populated when PSN is exists regardless corresponding PNN is exits. And populate the wmbs_location_senames table only corresponding PNN is exist.

Then we can run the test to check what brakes, and fix things from there.

Sorry for the late reply.

ericvaandering commented 8 years ago

Sorry, are we still using wmbs_location_senames as a table? I thought the PNN patch and cleanup would have totally gotten rid of that.

hufnagel commented 8 years ago

If we still use it, it would store PNN now. But yes, we should probably rename the table then. Too confusing otherwise in the long run.

ticoann commented 8 years ago

Alex already created the patch for renaming it. https://github.com/dmwm/WMCore/pull/6365 I will run the test first.

amaltaro commented 7 years ago

Sooo, getting back to this long long thread :-)

@alexanderrichards I think the last comment made by Seangchan also express my ideas: """ Alex, could you update your patch, so wmbs_location table is populated when PSN is exists regardless corresponding PNN is exits. And populate the wmbs_location_senames table only corresponding PNN is exist. """

So, when we fetch data from sitedb, we add that resource to the local database regardless whether it has a PNN or not. If it has, then we add it to wmbs_location_pnns, otherwise changes are made to wmbs_location only with the PSN.

Can you update your patch and run tests to see what is going to break?