dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Big jsons uploaded to WMArchive #10879

Closed todor-ivanov closed 2 months ago

todor-ivanov commented 2 years ago

Impact of the bug WMAgent, WMArchive

Describe the bug While debugging missing monitoring information from ES the Monit team has discovered some quite big json files uploaded to WMarchive and ES [1]. The document's size has been of the order of 2MBs. The documents themselves seem to be sane. Their origin are few workflows which tend to create huge LFNArrays for the input files for the job itself. This is a behavior on which we need more feedback from other groups like P&R and PDMv in order to estimate the expected future frequency of those.

In the meantime there was a temporary solution applied by Monit side so that we may continue to use the monitoring properly (all is well explained in the relevant ticket [1]), but we cannot rely on that in the long term. So this ticket is expected to be a follow up on the decisions taken on our side to revisit the information which we upload to WMArchive and ES.

[1] https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2961279

How to reproduce it N/A In order to reproduce those huge documents we need some feedback so we can relate to a speciffic type of workflows/campaigns.

Expected behavior Documents uploaded to WMArchive and ES to be in the order of kilobytes rather than megabytes.

Additional context and error message N/A

leggerf commented 2 years ago

@todor-ivanov we have today the meeting with MONIT, will you be able to participate and update on this issue?

todor-ivanov commented 2 years ago

Hi @leggerf, I am ok for joining. There have not been much progress on that yet, though. What we managed to figure out last time is that it was a workflow related issue for some specific type of workflows. It would be also good to know if the current data size is still relevantly big or it is back to normal?

vkuznet commented 2 years ago

MONIT team once again informed about large WMArchive JSON docs which made significant impact on performance of Kafka pipeline. @leggerf can provide all details. I suggest that WMCore team really addresses this issue with high priority since it can lead to two unpleasant scenarios:

  1. MONIT will start dropping WMArchive data (which they already did) and our data-ops team will start loosing Monitoring information
  2. They can block WMArchive service to have Kafka pipeline alive which again will affect data-ops
amaltaro commented 2 years ago

Todor and I had a chat and I understood that this problem was actually caused by misbehaving workflows, where thousands of input files were added to the LFNArrays field.

I also managed to find one of my replies - discussed among a small group of people - here it is: """ it looks like the LFNArray (and PFNArray) contains a list of secondary files read by the job. As we increase the pileup, this might become more common.

Given that the list of secondary files used by that job is already available in the cmsRun FJR, I'd be in favor of actually dropping it from WMArchive. In other words, LFNArray could report only primary (and parent, if any) files used by that given job.

We might as well drop the PFNArray... I believe we can't even say where the file came from, but just the protocol used to access the file. I shouldn't say it, but it looks like we could review, document, and refactor what actually gets posted to WMArchive... """

In short, we need to review LFNArrays and PFNArrays are still meaningful for WMArchive service and its users, and decide whether we drop those from the report, or whether we modify its schema to avoid that large amount of data.

leggerf commented 2 years ago

yes, this is exactly the summary of actions when we opened the issue end of October. The timescale for this was possibly to give feedback in about 2 weeks. Now it’s almost end of 2021. Can you pls bump up the priority of this issue, so that we can possibly give feedback to the MONIT team ASAP?

On 20 Dec 2021, at 14:32, Alan Malta Rodrigues @.***> wrote:

Todor and I had a chat and I understood that this problem was actually caused by misbehaving workflows, where thousands of input files were added to the LFNArrays field.

I also managed to find one of my replies - discussed among a small group of people - here it is: """ it looks like the LFNArray (and PFNArray) contains a list of secondary files read by the job. As we increase the pileup, this might become more common.

Given that the list of secondary files used by that job is already available in the cmsRun FJR, I'd be in favor of actually dropping it from WMArchive. In other words, LFNArray could report only primary (and parent, if any) files used by that given job.

We might as well drop the PFNArray... I believe we can't even say where the file came from, but just the protocol used to access the file. I shouldn't say it, but it looks like we could review, document, and refactor what actually gets posted to WMArchive... """

In short, we need to review LFNArrays and PFNArrays are still meaningful for WMArchive service and its users, and decide whether we drop those from the report, or whether we modify its schema to avoid that large amount of data.

— Reply to this email directly, view it on GitHub https://github.com/dmwm/WMCore/issues/10879#issuecomment-997928400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ4EWQNGFPFD3NWVX7QVCOLUR4V6PANCNFSM5G2WDZYQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.

vkuznet commented 2 years ago

@amaltaro , @todor-ivanov I can take and resolve this issue by myself but I need to know which solution do you want to put in place. I see two possibilities here:

So, which one do you prefer?

amaltaro commented 2 years ago

I would go with the most efficient and consistent solution, which is, dropping the PFNArray from the very beginning. In other words, we will need to change the data schema for such documents, such that WMAgent no longer converts PFNs from the FrameworkJobReport to PFNArray in the WMArchive document.

If we think a filename-based scenario is no longer possible with ES WMArchive, perhaps we should also drop the LFNArray field(s)?

vkuznet commented 2 years ago

@amaltaro I made necessary change in PR #10998 and PFNArray and PFNArrayRef no longer appear in a final document. I doubt we need to change the schema though. The schema only list possible attributes but it does not require them. Moreover, I don't recall any use-case where we need PFNArray. The only use case we cover is search for logArchive and logCollect tar-balls on HDFS. It has nothing to do with ES since MONIT data route to both destinations. I suggest we only drop PFNArray and collect more feedback from Monitoring group, i.e. if they will still experience issue with large logs we should look at concrete example and decide if we'll keep LFNs. If you think that we no longer need to cover aforementioned use-case we can drop LFNArray too which will certainly reduce final document size. In any way, neither LFNArray or PFNArray used in any monitoring dashboards since this information is useless for dashboards.

amaltaro commented 2 years ago

@vkuznet from our last discussion on the WMArchive subject, I understood that its data no longer goes to HDFS. It now only goes to one of the MonIT storage backends (ES? Influx?) and that any use case based on file names would not be valid. Is that correct?

Given that the schema only lists possible attributes, then it makes sense to simply leave it there indeed.

vkuznet commented 2 years ago

Alan, we send data to MONIT, and MONIT redirect data to different sinks. In case of WMArchive, data goes to ES and HDFS. The ES has 1 month retention policy, while HDFS 13 months. For ES, the LFN/PFN array are not valid since nothing you can do with them, but for HDFS it may be the case and I pointed to our previous use case. But we only used LFNs and not PFNs. As such PFNs can simply be dropped. I think this case cab be closed once we merge this PR.

amaltaro commented 2 years ago

This one was missing in the "Work in progress" column, doing so.

leggerf commented 2 years ago

@amaltaro @vkuznet is there an eta for deployment of this fix?

vkuznet commented 2 years ago

@leggerf , there is a progress on this issue, I prepared new PR #10998 to address this issue, but it is under review from Alan. Once Alan approves it we'll need additional time to propagate things to all WMAgents (again Alan can tell more how long it will take).

haozturk commented 3 months ago

Hi all, coming back to this issue from 2024. At the moment, there is a use case in CMS DM where we need to know PFNArray of the jobs, which is intended to be consumed by rucio-tracers [1]. The end goal is to be able to spot suspicious/corrupt replicas and re-transfer them from tape automatically. We'd like to use the relevant error messages from WMArchive for this. At the moment, it's not possible to know which pfn(s) WMA failed to read. I'd like to re-discuss the possibility of including PFNArray in WMArchive records. In order to keep the amount of data it'll generate under control, I can suggest 2 things:

  1. Drop LFNArray instead. Effectively, we can obtain it from PFNArray if needed. This will require changes in rucio-tracers as it relies on LFNArray. I can handle that part if we take this route.
  2. As I understand what exacerbate the data we send to MONIT is the pileup files we include. Pileup files are not of interest for this use case, as the failures to read them are mostly transient due to the way we read them. So, we can skip them for this use case.

Probably first option is better since in the future some needs might arise to have pileup files in WMArchive. Please let me know what you think @vkuznet @amaltaro @todor-ivanov

Many thanks!

[1] https://github.com/dmwm/rucio-tracers

vkuznet commented 3 months ago

@haozturk , I understand the desire to get proper info from WMArchive but here we face one particular problem the arrays of LFNs and PFNs are not in any regard useful for monitoring, i.e. you can't plot them. Your use case is not related to Monitoring. Therefore, I think we should revisit WMArchive use-case. I see two of them:

  1. one used for data-ops monitoring, e.g. in particular this dashboard https://monit-grafana.cern.ch/d/u_qOeVqZk/wmarchive-monit?orgId=11
  2. Use WMArchive data for debugging purposes. e.g. as you described

These two use-case demand different data in WMArchive. The monitoring data are light, while debugging data can be heavy. Since we send data to MONIT we must comply with MONIT restrictions which currently set to 30MB. The LFNs, PFNs, lumis, etc are very large objects within WMArchive and moreover we do not know their size as it really depends on size of the dataset, blocks, etc. Therefore, it is possible at any time and at any threshold (on MONIT) to exceed it. And, this is a problem.

To properly resolve it I think we should have two different streams:

If MONIT will be involved in accepting large documents than we can't use it reliably without cutting off our documents. Therefore, we must discuss where and how to send large data streams. One possibility to zip WMArchive content before sending to MONIT for MONIT->HDFS sink, this will work but on a client side you'll need to decompress it. If it is not acceptable then the best place would be to either dump WMArchive to ORACLE as JSON blobs or dump WMArchive docs to HDFS directly (in this case we need to provide direct access to HDFS from WMArchive).

To sum-up, we mixed different use-case and hit internal MONIT constrains on document size. To resolve it we should decouple data-streams and work on proper solution for large data-stream storage.

haozturk commented 3 months ago

Thanks a lot @vkuznet for the nice summary and suggestion. I listed down the current and new use cases and which fields they require [1]. Unless, I'm missing something, the data that's required by the first use case is already available in OpenSearch monit_prod_condor_raw_metric index. We just need to create a new dashboard replacing the data source and decommision WMArchive grafana dashboard. I'll coordinate this. Most of the fields that this use case requires are also required by the other use cases, so I don't think we can trim WMArchive data significantly. I'm trying to understand your suggestion of sending the data to HDFS. Can we still make WMArchive send data to AMQ if we take this route? The data that's sent to AMQ is consumed immediately, so I suspect it shouldn't be an issue even if we add PFN data. Given all this, I'm proposing the following roadmap:

  1. Check if there is any use case other than the ones I listed below. If not, continue
  2. Create an identical grafana dashboard to replace the WMArchive dashboard using the condor index in opensearch
  3. Add PFN data to WMArchive
  4. Start sending WMArchive data to HDFS and stop sending to opensearch. Keep sending it to AMQ.
  5. Decommission WMArchive grafana dashboard

How does this plan sound to you?

[1]

1. Existing use case: Monitoring for P&R:

  1. Campaign
  2. meta_data.jobtype
  3. meta_data.host
  4. meta_data.jobstate
  5. steps.site
  6. steps.errors.exitCode
  7. wmats
  8. wmaid
  9. meta_data.wn_name
  10. task

2. Existing use case: Rucio-tracers:

  1. LFN data (LFNArray, LFNArrayRef etc.)
  2. FallbackFiles 3.. Metadata:
    1. Ts
    2. JobType
    3. WnName 4.. Steps
    4. Input
      1. Lfn
      2. Events
      3. GUID
    5. Site
    6. Errors
      1. Details
      2. ExitCode
      3. Type

3. New use case: Extending rucio-tracers with PFNs and ErrorMessages

  1. Everything the previous use case has
  2. PFN info (Things this PR removed https://github.com/dmwm/WMCore/pull/10998/files)
vkuznet commented 3 months ago

@haozturk , not everything as simple as you described.

  1. integrate HDFS into WMArchive such that later will write directly to it, like to any other filesystem, e.g.
    data -> WMArchive -> HDFS

    Since WMArchive runs on k8s, it means its image (which is very tiny) should be modified to include HDFS which by itself would be very huge and lots of dependencies. I rather would like to avoid that for many reason, mostly maintenance, but I also consider this as not reliable since HDFS requires valid kerberos ticket (it means WMArchive must support it), it requires open ports to HDFS server, and not to mentioned huge HDFS (Java based) stack.

  2. Add new service on dedicated VM which will consume data and pass them to HDFS, e.g.
    data -> WMArchive -> NewProxyServer -> HDFS
        k8s           VM

    In this scenario NewProxyServer will run on dedicated node (VM with HDFS), will provide API to consume JSON data and write it directly to HDFS.

On top of that, WMArchive should be modified to separate data streams to AMQ and either to HDFS or NewProxyServer.

Bottom line, you need to identify

amaltaro commented 3 months ago

Let me twist this discussion a bit.

Before we say that we cannot send large documents to MonIT, I think we need to define - and considering that a bug parsing error messages and writing them to the WMArchive documents has been recently fixed and not fully deployed yet: a) what is the definition of a large document? What is our average document size (if possible, excluding the bug above) b) instead of providing PFNs in these WMArchive documents, I think we should keep providing LFNs (for primary data) and in addition to that, also provide the PFN prefix - such that the client can build the PFNs as needed c) this comment gives us a pretty good idea of what is desired in these documents or not. Hence, I am in favor of revisiting the construction of the WMArchive document in WMAgent and ensure that we provide useful and needed information, dropping anything that we consider irrelevant now or for the near future.

If we can keep one data source/flow in MonIT, I think this would be the best option. At least to me, I consider it already extremely hard to find where things are stored, indexes, technologies used, retention policy and etc for the already existent monitoring data.

vkuznet commented 3 months ago

Alan, I think regardless of your suggestion we can't 100% claim that we can't exceed MONIT limit (30MB) if we'll keep LFNs or PFNs since our dataset size is unknown. We probably can find out how many LFNs/PFNs will fit into MONIT limit but that's it. The question is do we want 100% presence of WMArchive docs in MONIT or not. If answer is yes, then adding LFNs/PFNs sooner or later will hit the limit imposed by MONIT, i.e. it is unavoidable, and therefore document will be rejected.

Said that, it is also very clear how to define large document, it it a document which size is more than MONIT limit, currently 30MB. Now, here is a simple estimate about number of LFNs which will fit into 30MB:

>>> lfn="/store/mc/Summer11/ZMM/GEN-SIM/DESIGN42_V11_428_SLHC1-v1/0003/02ACAA1A-9F32-E111-BB31-0002C90B743A.root"
>>> import json
>>> lfns = [lfn for _ in range(0,10000)]
>>> fd=open('/tmp/vk/lfns.json', 'w')
>>> fd.write(json.dumps(lfns))
1070000

So, 10k LFNs fits in 1MB, therefore 30MB will be reached with 30k LFNs or so. You may browse DBS to find out how many datasets and what kind of datasets have such number of LFNs.

amaltaro commented 3 months ago

So, 10k LFNs fits in 1MB, therefore 30MB will be reached with 30k LFNs or so.

it would be ~300k LFNs.

Honestly, I don't think it is impossible to see 1000s of files being read by a single job. However, 99.999% of the jobs will read no more than a few files, exceptions go to:

vkuznet commented 3 months ago

ok, thanks for correction, simple mistake on my side. 300K LFNs seems large but not impossible to me. I don't know which use-cases should or should not be covered though, I hope @haozturk can provide more insight where debugging is required. And, according to structure listed above in @haozturk comment, we have LFN data (LFNArray, LFNArrayRef, Fallback files, Input.Lfn, so the 300K can be scaled easily back to lower number due to these requirements.

haozturk commented 3 months ago

Thanks for the insights @vkuznet and @amaltaro . We're not looking for the perfect solution, so 100% isn't required. It's okay to skip jobs whose records will not fit the MONIT limit. As I understand, this is a rare case.

Regarding Alan's comment (https://github.com/dmwm/WMCore/issues/10879#issuecomment-2125043864), we don't care about unmerged files at the moment, but I'd keep them in case a new use case arises in the future concerning them. For the current case, we don't care about log files that logcollect jobs report, either.

amaltaro commented 2 months ago

Given that this issue was created out of problems affecting production, and that WMAgent/WMArchive interface has been stable for a few months, I decided to make a "planned" ticket to address what is left in WMArchive with this ticket: https://github.com/dmwm/WMCore/issues/12043

We can consider that ticket for the upcoming quarters, if desired to. Having said that, I am closing this issue out, but please let me know if I missed anything. Thanks!