Closed mxsg closed 5 years ago
Maximilian, sorry for late reply, I was on vacation and haven't had regular internet access.
The WMArchive data collects data from both WMAgent and CRAB systems. The former
is located in hdfs:///cms/wmarchive/avro/fwjr
while later in
hdfs:///cms/wmarchive/avro/crab
HDFS areas.
The JobMonitoring data collects data from all CMS jobs, including WMAgent, CMSSW, CRAB, etc.
It may be that you see only portion due to different accounting algorithms. For example, in WMArchive we store failed jobs, while I don't know if JobMonitoring count them. Since I don't know how you perform your job selection I can't comment more. But even if you show me your CMSSpark/WMArchive code we may not know the full truth since I don't know exactly how JobMonitoring documents are collected.
One thing that you can check is to look at your WMArchive docs and find out their status. For that you'll need to look-up and store jobtype and jobstatus attributes of the documents, see https://github.com/dmwm/WMArchive/blob/master/src/python/WMArchive/Schemas/FWJRProduction.py#L11
Best, Valentin.
On 0, Maximilian Stemmer-Grabow notifications@github.com wrote:
Hi,
I am using CMSSpark to collect job reports from the JobMonitoring and WMArchive (FWJR) data. I extract reports from jobs run at GridKa to use them to create simulation models for the site.
When matching the reports from these data sets, I noticed that some reports seem to be missing from the JobMonitoring data set I collected. I am aware that WMArchive data only include jobs submitted via WMAgent, but was previously under the impression that the JobMonitoring data should include job reports for all jobs (also including CRAB jobs).
The scripts I use for collection are based on the provided scripts for these data sets, but configured to collect unaggregated reports, limited to the GridKa site (
T1_DE_KIT
).I am not sure which reports exactly I am missing from my JobMonitoring data set, but it does not seem tied to the workflow (most workflows where data seems to be missing exist in both data sets). These are the workflows where the difference between the number of jobs in both data sets is most striking, based on all job reports from June:
TaskMonitorId wmarchive_count jobmonitoring_count diff 0 pdmvserv_task_HIG-RunIISummer18wmLHEGS-00012__v1_T_180606_161154_7324 34283.0 73.0 34210.0 1 prozober_task_SMP-PhaseIISummer17GenOnly-00008__v1_T_180607_113950_5438 13820.0 15.0 13805.0 2 prozober_task_SMP-RunIISummer15wmLHEGS-00224__v1_T_180611_192653_6323 13133.0 4.0 13129.0 3 prozober_task_SMP-RunIISummer15wmLHEGS-00226__v1_T_180611_192517_1020 12549.0 798.0 11751.0 4 pdmvserv_task_SMP-PhaseIISummer17GenOnly-00021__v1_T_180608_101222_6299 11517.0 34.0 11483.0 5 pdmvserv_task_SMP-RunIISummer15wmLHEGS-00225__v1_T_180416_194002_549 9113.0 1193.0 7920.0 6 prozober_task_SMP-RunIISummer15wmLHEGS-00222__v1_T_180611_192556_7810 4285.0 142.0 4143.0 7 pdmvserv_task_SMP-RunIISummer15wmLHEGS-00227__v1_T_180416_194004_2476 7852.0 4665.0 3187.0 8 pdmvserv_task_SMP-PhaseIISummer17GenOnly-00022__v1_T_180607_163200_4994 2559.0 35.0 2524.0 9 asikdar_RVCMSSW_10_2_0_pre6HydjetQ_B12_5020GeV_2018_PU__180629_181916_5151 2434.0 415.0 2019.0 10 pdmvserv_task_SMP-RunIISummer15wmLHEGS-00223__v1_T_180416_193956_2123 1697.0 2.0 1695.0 11 pdmvserv_task_HIG-RunIISummer18wmLHEGS-00004__v1_T_180604_003844_1884 2612.0 965.0 1647.0 12 pdmvserv_task_HIG-RunIIFall17wmLHEGS-01913__v1_T_180608_124404_6320 2089.0 447.0 1642.0 13 vlimant_ACDC0_task_JME-RunIISummer18wmLHEGS-00002__v1_T_180619_140519_3271 1815.0 510.0 1305.0 14 pdmvserv_task_HIG-RunIIFall17wmLHEGS-00917__v1_T_180606_161025_1459 1498.0 226.0 1272.0 15 pdmvserv_task_EGM-RunIISummer18GS-00016__v1_T_180605_170606_3059 1444.0 222.0 1222.0 16 pdmvserv_task_TRK-RunIISpring18wmLHEGS-00001__v1_T_180524_151259_1146 1483.0 292.0 1191.0 17 pdmvserv_task_EGM-RunIISummer18GS-00024__v1_T_180605_170611_5611 1438.0 388.0 1050.0 18 pdmvserv_task_HIN-HiFall15-00303__v1_T_180502_124221_8353 1802.0 834.0 968.0 19 pdmvserv_task_SMP-RunIIFall17wmLHEGS-00021__v1_T_180415_203914_1639 1194.0 259.0 935.0
The data sets are collected from the following HDFS locations:
- JobMonitoring:
hdfs:///project/awg/cms/jm-data-popularity/avro-snappy
- WMArchive/FWJR:
hdfs:///cms/wmarchive/avro/fwjr
Is there anything I simply may have missed about the data sets? Any pointers are very much appreciated.
If any additional information is needed, I’ll be happy to provide it.
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/CMSSpark/issues/13
Valentin,
thank you for the pointers. I'll still have to check about the WMArchive status info.
As I understand it, failed jobs are also included in the JobMonitoring data (also confirmed by the exit code field in the data set). I compared the results I am getting with the statistics from this CMS Dashboard page. I assumed it uses the same JobMonitoring data set I should be getting from the HDFS archive, but the job counts are much higher than in my JobMonitoring data set.
In comparison, the WMArchive data seems to match up with the status page much better. For example, in the case of the page linked above (for GridKa and 10.08.2018), the dashboard reports 37887 completed jobs, but I am only getting 10980 entries in the JobMonitoring data set as compared to 30945 records from WMArchive (which also cannot only be accounted for by failed jobs).
For the JobMonitoring data, I am now using a very simple CMSSpark script as I tried to find reasons why I might be missing reports collected with it. It only filters the records and otherwise extracts all columns from them:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Spark script to parse JobMonitoring records on HDFS.
"""
import calendar
import datetime
import hashlib
import re
import time
from CMSSpark.conf import OptionParser
from CMSSpark.schemas import schema_jm
from CMSSpark.spark_utils import avro_rdd
from CMSSpark.spark_utils import spark_context
from CMSSpark.utils import info
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def jm_date(date):
"""Convert given date into JobMonitoring date format"""
if not date:
date = time.strftime("year=%Y/month=%-m/date=%d", time.gmtime(time.time() - 60 * 60 * 24))
return date
if len(date) != 8:
raise Exception("Given date %s is not in YYYYMMDD format")
year = date[:4]
month = int(date[4:6])
day = int(date[6:])
return 'year=%s/month=%s/day=%s' % (year, month, day)
def jm_date_unix(date):
"""Convert JobMonitoring date into UNIX timestamp"""
return time.mktime(time.strptime(date, 'year=%Y/month=%m/day=%d'))
# Global patterns
PAT_YYYYMMDD = re.compile(r'^20[0-9][0-9][0-1][0-9][0-3][0-9]$')
PAT_YYYY = re.compile(r'^20[0-9][0-9]$')
PAT_MM = re.compile(r'^(0[1-9]|1[012])$')
PAT_DD = re.compile(r'^(0[1-9]|[12][0-9]|3[01])$')
def dateformat(value):
"""Return seconds since epoch for provided YYYYMMDD or number with suffix 'd' for days"""
msg = 'Unacceptable date format, value=%s, type=%s,' \
% (value, type(value))
msg += " supported format is YYYYMMDD or number with suffix 'd' for days"
value = str(value).lower()
if PAT_YYYYMMDD.match(value): # we accept YYYYMMDD
if len(value) == 8: # YYYYMMDD
year = value[0:4]
if not PAT_YYYY.match(year):
raise Exception(msg + ', fail to parse the year part, %s' % year)
month = value[4:6]
date = value[6:8]
ddd = datetime.date(int(year), int(month), int(date))
else:
raise Exception(msg)
return calendar.timegm((ddd.timetuple()))
elif value.endswith('d'):
try:
days = int(value[:-1])
except ValueError:
raise Exception(msg)
return time.time() - days * 24 * 60 * 60
else:
raise Exception(msg)
def range_dates(trange):
"""Provides dates range in HDFS format from given list"""
out = [jm_date(str(trange[0]))]
if trange[0] == trange[1]:
return out
tst = dateformat(trange[0])
while True:
tst += 24 * 60 * 60
tdate = time.strftime("%Y%m%d", time.gmtime(tst))
out.append(jm_date(tdate))
if str(tdate) == str(trange[1]):
break
return out
def hdfs_path(hdir, dateinput):
"""Construct HDFS path for JM data"""
dates = dateinput.split('-')
if len(dates) == 2:
return ['%s/%s' % (hdir, d) for d in range_dates(dates)]
dates = dateinput.split(',')
if len(dates) > 1:
return ['%s/%s' % (hdir, jm_date(d)) for d in dates]
return ['%s/%s' % (hdir, jm_date(dateinput))]
def run(date, fout, hdir, yarn=None, verbose=None):
"""
Main function to run pyspark job. It requires a schema file, an HDFS directory
with data and optional script with mapper/reducer functions.
"""
# Define a spark context (main object which allows to communicate with Spark)
ctx = spark_context('cms', yarn, verbose)
sql_context = HiveContext(ctx)
def filter_record(row, conditions=None):
"""Filter a record based on conditions for its fields."""
if not conditions:
return True
for column, condition in conditions.iteritems():
if row.get(column) != condition:
return False
return True
rdd = avro_rdd(ctx, sql_context, hdfs_path(hdir, date), '', verbose)
if verbose:
print("Number of entries before filtering: %s" % str(rdd.count()))
filter_conditions = {'SiteName': 'T1_DE_KIT'}
filtered_rdd = rdd.filter(lambda row: filter_record(row, filter_conditions))
if verbose:
print("Number of entries after filtering: %s" % str(filtered_rdd.count()))
print(filtered_rdd.take(1))
jdf = sql_context.createDataFrame(filtered_rdd, schema=schema_jm())
# Creates a unique identifier by hashing the provided columns with MD5
def create_identifier(job_id, start, stop):
h = hashlib.md5()
col_list = [job_id, start, stop]
for value in col_list:
h.update(str(value).encode('utf-8'))
return h.hexdigest()
# Define function for computing the unique identifier and add to data frame
identifier_udf = udf(create_identifier, StringType())
jdf = jdf.withColumn('UniqueID',
identifier_udf(jdf['JobId'],
jdf['StartedRunningTimeStamp'],
jdf['JobExecExitTimeStamp']))
if verbose:
print("Number of entries in data frame: %s" % str(jdf.count()))
if fout:
jdf.write.format("com.databricks.spark.csv").option("header", "true").save(fout)
# optional: jdf.coalesce(1).write... (only writes header once)
ctx.stop()
@info
def main():
"""Main Function"""
# Handle script options
optmgr = OptionParser('jm')
opts = optmgr.parser.parse_args()
print("Input arguments: %s" % opts)
default_hdir = 'hdfs:///project/awg/cms/jm-data-popularity/avro-snappy'
# default_hdir = 'hdfs:///project/awg/cms/job-monitoring/avro-snappy'
hdir = opts.hdir if opts.hdir else default_hdir
# Run main script
run(opts.date, opts.fout, hdir, opts.yarn, opts.verbose)
if __name__ == '__main__':
main()
Best, Maximilian
Valentin, another thing I was previously unsure about is the exact difference between the two HDFS locations with JobMonitoring information:
hdfs:///project/awg/cms/jm-data-popularity/avro-snappy
hdfs:///project/awg/cms/job-monitoring
As far as I understand it, jm-data-popularity
contains only a subset of the fields contained in job-monitoring
, but should otherwise not be connected to the question concerning the potentially missing reports?
Best, Maximilian
Maximilian, couple of questions
On 0, Maximilian Stemmer-Grabow notifications@github.com wrote:
Valentin,
thank you for the pointers. I'll still have to check about the WMArchive status info.
As I understand it, failed jobs are also included in the JobMonitoring data (also confirmed by the exit code field in the data set). I compared the results I am getting with the statistics from this CMS Dashboard page. I assumed it uses the same JobMonitoring data set I should be getting from the HDFS archive, but the job counts are much higher than in my JobMonitoring data set.
In comparison, the WMArchive data seems to match up with the status page much better. For example, in the case of the page linked above (for GridKa and 10.08.2018), the dashboard reports 37887 completed jobs, but I am only getting 10980 entries in the JobMonitoring data set as compared to 30945 records from WMArchive (which also cannot only be accounted for by failed jobs).
For the JobMonitoring data, I am now using a very simple CMSSpark script as I tried to find reasons why I might be missing reports collected with it. It only filters the records and otherwise extracts all columns from them:
#!/usr/bin/env python # -*- coding: utf-8 -*- """ Spark script to parse JobMonitoring records on HDFS. """ import calendar import datetime import hashlib import re import time from CMSSpark.conf import OptionParser from CMSSpark.schemas import schema_jm from CMSSpark.spark_utils import avro_rdd from CMSSpark.spark_utils import spark_context from CMSSpark.utils import info from pyspark.sql import HiveContext from pyspark.sql.functions import udf from pyspark.sql.types import StringType def jm_date(date): """Convert given date into JobMonitoring date format""" if not date: date = time.strftime("year=%Y/month=%-m/date=%d", time.gmtime(time.time() - 60 * 60 * 24)) return date if len(date) != 8: raise Exception("Given date %s is not in YYYYMMDD format") year = date[:4] month = int(date[4:6]) day = int(date[6:]) return 'year=%s/month=%s/day=%s' % (year, month, day) def jm_date_unix(date): """Convert JobMonitoring date into UNIX timestamp""" return time.mktime(time.strptime(date, 'year=%Y/month=%m/day=%d')) # Global patterns PAT_YYYYMMDD = re.compile(r'^20[0-9][0-9][0-1][0-9][0-3][0-9]$') PAT_YYYY = re.compile(r'^20[0-9][0-9]$') PAT_MM = re.compile(r'^(0[1-9]|1[012])$') PAT_DD = re.compile(r'^(0[1-9]|[12][0-9]|3[01])$') def dateformat(value): """Return seconds since epoch for provided YYYYMMDD or number with suffix 'd' for days""" msg = 'Unacceptable date format, value=%s, type=%s,' \ % (value, type(value)) msg += " supported format is YYYYMMDD or number with suffix 'd' for days" value = str(value).lower() if PAT_YYYYMMDD.match(value): # we accept YYYYMMDD if len(value) == 8: # YYYYMMDD year = value[0:4] if not PAT_YYYY.match(year): raise Exception(msg + ', fail to parse the year part, %s' % year) month = value[4:6] date = value[6:8] ddd = datetime.date(int(year), int(month), int(date)) else: raise Exception(msg) return calendar.timegm((ddd.timetuple())) elif value.endswith('d'): try: days = int(value[:-1]) except ValueError: raise Exception(msg) return time.time() - days * 24 * 60 * 60 else: raise Exception(msg) def range_dates(trange): """Provides dates range in HDFS format from given list""" out = [jm_date(str(trange[0]))] if trange[0] == trange[1]: return out tst = dateformat(trange[0]) while True: tst += 24 * 60 * 60 tdate = time.strftime("%Y%m%d", time.gmtime(tst)) out.append(jm_date(tdate)) if str(tdate) == str(trange[1]): break return out def hdfs_path(hdir, dateinput): """Construct HDFS path for JM data""" dates = dateinput.split('-') if len(dates) == 2: return ['%s/%s' % (hdir, d) for d in range_dates(dates)] dates = dateinput.split(',') if len(dates) > 1: return ['%s/%s' % (hdir, jm_date(d)) for d in dates] return ['%s/%s' % (hdir, jm_date(dateinput))] def run(date, fout, hdir, yarn=None, verbose=None): """ Main function to run pyspark job. It requires a schema file, an HDFS directory with data and optional script with mapper/reducer functions. """ # Define a spark context (main object which allows to communicate with Spark) ctx = spark_context('cms', yarn, verbose) sql_context = HiveContext(ctx) def filter_record(row, conditions=None): """Filter a record based on conditions for its fields.""" if not conditions: return True for column, condition in conditions.iteritems(): if row.get(column) != condition: return False return True rdd = avro_rdd(ctx, sql_context, hdfs_path(hdir, date), '', verbose) if verbose: print("Number of entries before filtering: %s" % str(rdd.count())) filter_conditions = {'SiteName': 'T1_DE_KIT'} filtered_rdd = rdd.filter(lambda row: filter_record(row, filter_conditions)) if verbose: print("Number of entries after filtering: %s" % str(filtered_rdd.count())) print(filtered_rdd.take(1)) jdf = sql_context.createDataFrame(filtered_rdd, schema=schema_jm()) # Creates a unique identifier by hashing the provided columns with MD5 def create_identifier(job_id, start, stop): h = hashlib.md5() col_list = [job_id, start, stop] for value in col_list: h.update(str(value).encode('utf-8')) return h.hexdigest() # Define function for computing the unique identifier and add to data frame identifier_udf = udf(create_identifier, StringType()) jdf = jdf.withColumn('UniqueID', identifier_udf(jdf['JobId'], jdf['StartedRunningTimeStamp'], jdf['JobExecExitTimeStamp'])) if verbose: print("Number of entries in data frame: %s" % str(jdf.count())) if fout: jdf.write.format("com.databricks.spark.csv").option("header", "true").save(fout) # optional: jdf.coalesce(1).write... (only writes header once) ctx.stop() @info def main(): """Main Function""" # Handle script options optmgr = OptionParser('jm') opts = optmgr.parser.parse_args() print("Input arguments: %s" % opts) default_hdir = 'hdfs:///project/awg/cms/jm-data-popularity/avro-snappy' # default_hdir = 'hdfs:///project/awg/cms/job-monitoring/avro-snappy' hdir = opts.hdir if opts.hdir else default_hdir # Run main script run(opts.date, opts.fout, hdir, opts.yarn, opts.verbose) if __name__ == '__main__': main()
Best, Maximilian
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/CMSSpark/issues/13#issuecomment-414579125
Maximilian, here you need to truly understand how data are fed into these areas. As far as I know these data originate from these scripts: https://gitlab.cern.ch/awg/awg-ETL-crons/blob/master/sqoop/cms-jm.sh https://gitlab.cern.ch/awg/awg-ETL-crons/blob/master/sqoop/jm-cms-data-pop.sh which run by CERN IT to extract data records from CMSSW popularity Oracle database. Assuming that extraction step doing its job correctly you need to ask who and how put data into this database.
I'm CC Carl who may answer this question.
Best, Valentin. P.S. Carl I don't know your github user name, could you please respond to github ticket directly that everybody will be in a loop.
On 0, Maximilian Stemmer-Grabow notifications@github.com wrote:
Valentin, another thing I was previously unsure about is the exact difference between the two HDFS locations with JobMonitoring information:
hdfs:///project/awg/cms/jm-data-popularity/avro-snappy
hdfs:///project/awg/cms/job-monitoring
As far as I understand it,
jm-data-popularity
contains only a subset of the fields contained injob-monitoring
, but should otherwise not be connected to the question concerning the potentially missing reports?Best, Maximilian
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/CMSSpark/issues/13#issuecomment-414655335
Valentin,
- do you run over fwjr collection of WMArchive or both (fwjr and crab ones)?
I am for now only running over the FWJR collection (I previously didn't know that CRAB reports were also collected).
- you seems to filter records for site T1_DE_KIT, but our site names used by different systems (DBS, PhEDEx, transfer) may have additional names, e.g. T1_DE_KIT_Buffer, T1_DE_KIT_Disk and T1_DE_KIT_MSS. Can you try to match all names via pattern?
I only matched to the T1_DE_KIT
identifier as I tested for alternative names by comparing a distinct entries in the unfiltered data set a while ago and only found these unsuffixed names for the sites. I will revisit this again and try to find out whether I am missing suffixed site names after all.
For the example above (jobs from only 10.08.2018), the only distinct site names from the JobMonitoring data set seem to be (only containing T1_DE_KIT
):
+--------------------+
| SiteName|
+--------------------+
| T2_UK_London_IC|
| T2_RU_INR|
| T2_CH_CERN_HLT|
| T2_RU_IHEP|
| unknown|
| T1_ES_PIC|
| T2_ES_CIEMAT|
| T2_IT_Pisa|
| T2_FI_HIP|
| T2_FR_GRIF_IRFU|
| T3_UK_London_RHUL|
| T1_FR_CCIN2P3|
| T2_US_Nebraska|
| T2_IT_Bari|
| T2_CN_Beijing|
| T2_BE_UCL|
| T2_BR_SPRACE|
| T3_KR_KNU|
| T3_US_UCR|
| T2_FR_IPHC|
| T2_RU_JINR|
| T2_KR_KISTI|
| T3_TW_NCU|
| T2_GR_Ioannina|
| T3_US_Baylor|
| T2_US_UCSD|
| T3_TW_NTU_HEP|
| T2_TR_METU|
| T1_IT_CNAF|
| T2_US_Caltech|
| T2_CH_CSCS|
| T2_EE_Estonia|
|T3_CH_CERN_HelixN...|
| T2_CH_CERN|
| T2_US_Purdue|
| T2_IT_Legnaro|
| T2_US_MIT|
| T2_HU_Budapest|
| T2_KR_KNU|
| T3_IT_Trieste|
| T1_RU_JINR|
| T3_US_Rutgers|
| T3_US_UMiss|
| T2_IN_TIFR|
| T2_PL_Warsaw|
| T2_PL_Swierk|
| T2_UK_SGrid_RALPP|
| T3_US_PSC|
| T3_US_TAMU|
| T2_UA_KIPT|
| T2_BE_IIHE|
| T3_IN_TIFRCloud|
| T3_US_Colorado|
| T2_PT_NCG_Lisbon|
| T2_IT_LegnaroTest|
| T2_US_Florida|
| T2_DE_RWTH|
| T2_UK_SGrid_Bristol|
| T2_RU_ITEP|
| T2_ES_IFCA|
| T1_UK_RAL|
| T2_FR_CCIN2P3|
| T2_PK_NCP|
| T2_DE_DESY|
| T3_US_OSG|
| T3_UK_SGrid_Oxford|
| T3_FR_IPNL|
| T2_UK_London_Brunel|
| T1_DE_KIT|
| T2_IT_Rome|
| T3_UK_ScotGrid_GLA|
| T1_US_FNAL|
| T2_US_Vanderbilt|
| T3_US_NERSC|
| T2_US_Wisconsin|
| T2_FR_GRIF_LLR|
| T2_CH_CSCS_HPC|
+--------------------+
Thank you again for your help!
Best, Maximilian
The Oracle CMSSW Data Popularity database is called CMS_CMSSW_POPULARITY@ CMSR. It is written to by a DB Consumer with service named cmssw-collector running on vocms044. Details are here: https://twiki.cern.ch/twiki/bin/view/CMS/CMSSW-Popularity
---Carl
On 08/21/2018 07:57 AM, Valentin Kuznetsov wrote:
Maximilian, here you need to truly understand how data are fed into these areas. As far as I know these data originate from these scripts: https://gitlab.cern.ch/awg/awg-ETL-crons/blob/master/sqoop/cms-jm.sh https://gitlab.cern.ch/awg/awg-ETL-crons/blob/master/sqoop/jm-cms-data-pop.sh which run by CERN IT to extract data records from CMSSW popularity Oracle database. Assuming that extraction step doing its job correctly you need to ask who and how put data into this database.
I'm CC Carl who may answer this question.
Best, Valentin. P.S. Carl I don't know your github user name, could you please respond to github ticket directly that everybody will be in a loop.
On 0, Maximilian Stemmer-Grabow notifications@github.com wrote:
Valentin, another thing I was previously unsure about is the exact difference between the two HDFS locations with JobMonitoring information:
hdfs:///project/awg/cms/jm-data-popularity/avro-snappy
hdfs:///project/awg/cms/job-monitoring
As far as I understand it,
jm-data-popularity
contains only a subset of the fields contained injob-monitoring
, but should otherwise not be connected to the question concerning the potentially missing reports?Best, Maximilian
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/CMSSpark/issues/13#issuecomment-414655335
I'm joining this thread, so you can ask me any additional questions here.
---Carl
do we need to keep this ticket open?
Hi,
I am using CMSSpark to collect job reports from the JobMonitoring and WMArchive (FWJR) data. I extract reports from jobs run at GridKa to use them to create simulation models for the site.
When matching the reports from these data sets, I noticed that some reports seem to be missing from the JobMonitoring data set I collected. I am aware that WMArchive data only include jobs submitted via WMAgent, but was previously under the impression that the JobMonitoring data should include job reports for all jobs (also including CRAB jobs).
The scripts I use for collection are based on the provided scripts for these data sets, but configured to collect unaggregated reports, limited to the GridKa site (
T1_DE_KIT
).I am not sure which reports exactly I am missing from my JobMonitoring data set, but it does not seem tied to the workflow (most workflows where data seems to be missing exist in both data sets). These are the workflows where the difference between the number of jobs in both data sets is most striking, based on all job reports from June:
The data sets are collected from the following HDFS locations:
hdfs:///project/awg/cms/jm-data-popularity/avro-snappy
hdfs:///cms/wmarchive/avro/fwjr
Is there anything I simply may have missed about the data sets? Any pointers are very much appreciated.
If any additional information is needed, I’ll be happy to provide it.