dmwm / CMSSpark

General purpose framework to run CMS experiment workflows on HDFS/Spark platform
MIT License
12 stars 21 forks source link

Missing job reports in JobMonitoring data? #13

Closed mxsg closed 5 years ago

mxsg commented 6 years ago

Hi,

I am using CMSSpark to collect job reports from the JobMonitoring and WMArchive (FWJR) data. I extract reports from jobs run at GridKa to use them to create simulation models for the site.

When matching the reports from these data sets, I noticed that some reports seem to be missing from the JobMonitoring data set I collected. I am aware that WMArchive data only include jobs submitted via WMAgent, but was previously under the impression that the JobMonitoring data should include job reports for all jobs (also including CRAB jobs).

The scripts I use for collection are based on the provided scripts for these data sets, but configured to collect unaggregated reports, limited to the GridKa site (T1_DE_KIT).

I am not sure which reports exactly I am missing from my JobMonitoring data set, but it does not seem tied to the workflow (most workflows where data seems to be missing exist in both data sets). These are the workflows where the difference between the number of jobs in both data sets is most striking, based on all job reports from June:

                                                                 TaskMonitorId  wmarchive_count  jobmonitoring_count     diff
0        pdmvserv_task_HIG-RunIISummer18wmLHEGS-00012__v1_T_180606_161154_7324          34283.0                 73.0  34210.0
1      prozober_task_SMP-PhaseIISummer17GenOnly-00008__v1_T_180607_113950_5438          13820.0                 15.0  13805.0
2        prozober_task_SMP-RunIISummer15wmLHEGS-00224__v1_T_180611_192653_6323          13133.0                  4.0  13129.0
3        prozober_task_SMP-RunIISummer15wmLHEGS-00226__v1_T_180611_192517_1020          12549.0                798.0  11751.0
4      pdmvserv_task_SMP-PhaseIISummer17GenOnly-00021__v1_T_180608_101222_6299          11517.0                 34.0  11483.0
5         pdmvserv_task_SMP-RunIISummer15wmLHEGS-00225__v1_T_180416_194002_549           9113.0               1193.0   7920.0
6        prozober_task_SMP-RunIISummer15wmLHEGS-00222__v1_T_180611_192556_7810           4285.0                142.0   4143.0
7        pdmvserv_task_SMP-RunIISummer15wmLHEGS-00227__v1_T_180416_194004_2476           7852.0               4665.0   3187.0
8      pdmvserv_task_SMP-PhaseIISummer17GenOnly-00022__v1_T_180607_163200_4994           2559.0                 35.0   2524.0
9   asikdar_RVCMSSW_10_2_0_pre6HydjetQ_B12_5020GeV_2018_PU__180629_181916_5151           2434.0                415.0   2019.0
10       pdmvserv_task_SMP-RunIISummer15wmLHEGS-00223__v1_T_180416_193956_2123           1697.0                  2.0   1695.0
11       pdmvserv_task_HIG-RunIISummer18wmLHEGS-00004__v1_T_180604_003844_1884           2612.0                965.0   1647.0
12         pdmvserv_task_HIG-RunIIFall17wmLHEGS-01913__v1_T_180608_124404_6320           2089.0                447.0   1642.0
13  vlimant_ACDC0_task_JME-RunIISummer18wmLHEGS-00002__v1_T_180619_140519_3271           1815.0                510.0   1305.0
14         pdmvserv_task_HIG-RunIIFall17wmLHEGS-00917__v1_T_180606_161025_1459           1498.0                226.0   1272.0
15            pdmvserv_task_EGM-RunIISummer18GS-00016__v1_T_180605_170606_3059           1444.0                222.0   1222.0
16       pdmvserv_task_TRK-RunIISpring18wmLHEGS-00001__v1_T_180524_151259_1146           1483.0                292.0   1191.0
17            pdmvserv_task_EGM-RunIISummer18GS-00024__v1_T_180605_170611_5611           1438.0                388.0   1050.0
18                   pdmvserv_task_HIN-HiFall15-00303__v1_T_180502_124221_8353           1802.0                834.0    968.0
19         pdmvserv_task_SMP-RunIIFall17wmLHEGS-00021__v1_T_180415_203914_1639           1194.0                259.0    935.0

The data sets are collected from the following HDFS locations:

Is there anything I simply may have missed about the data sets? Any pointers are very much appreciated.

If any additional information is needed, I’ll be happy to provide it.

vkuznet commented 6 years ago

Maximilian, sorry for late reply, I was on vacation and haven't had regular internet access.

The WMArchive data collects data from both WMAgent and CRAB systems. The former is located in hdfs:///cms/wmarchive/avro/fwjr while later in hdfs:///cms/wmarchive/avro/crab HDFS areas.

The JobMonitoring data collects data from all CMS jobs, including WMAgent, CMSSW, CRAB, etc.

It may be that you see only portion due to different accounting algorithms. For example, in WMArchive we store failed jobs, while I don't know if JobMonitoring count them. Since I don't know how you perform your job selection I can't comment more. But even if you show me your CMSSpark/WMArchive code we may not know the full truth since I don't know exactly how JobMonitoring documents are collected.

One thing that you can check is to look at your WMArchive docs and find out their status. For that you'll need to look-up and store jobtype and jobstatus attributes of the documents, see https://github.com/dmwm/WMArchive/blob/master/src/python/WMArchive/Schemas/FWJRProduction.py#L11

Best, Valentin.

On 0, Maximilian Stemmer-Grabow notifications@github.com wrote:

Hi,

I am using CMSSpark to collect job reports from the JobMonitoring and WMArchive (FWJR) data. I extract reports from jobs run at GridKa to use them to create simulation models for the site.

When matching the reports from these data sets, I noticed that some reports seem to be missing from the JobMonitoring data set I collected. I am aware that WMArchive data only include jobs submitted via WMAgent, but was previously under the impression that the JobMonitoring data should include job reports for all jobs (also including CRAB jobs).

The scripts I use for collection are based on the provided scripts for these data sets, but configured to collect unaggregated reports, limited to the GridKa site (T1_DE_KIT).

I am not sure which reports exactly I am missing from my JobMonitoring data set, but it does not seem tied to the workflow (most workflows where data seems to be missing exist in both data sets). These are the workflows where the difference between the number of jobs in both data sets is most striking, based on all job reports from June:

                                                                 TaskMonitorId  wmarchive_count  jobmonitoring_count     diff
0        pdmvserv_task_HIG-RunIISummer18wmLHEGS-00012__v1_T_180606_161154_7324          34283.0                 73.0  34210.0
1      prozober_task_SMP-PhaseIISummer17GenOnly-00008__v1_T_180607_113950_5438          13820.0                 15.0  13805.0
2        prozober_task_SMP-RunIISummer15wmLHEGS-00224__v1_T_180611_192653_6323          13133.0                  4.0  13129.0
3        prozober_task_SMP-RunIISummer15wmLHEGS-00226__v1_T_180611_192517_1020          12549.0                798.0  11751.0
4      pdmvserv_task_SMP-PhaseIISummer17GenOnly-00021__v1_T_180608_101222_6299          11517.0                 34.0  11483.0
5         pdmvserv_task_SMP-RunIISummer15wmLHEGS-00225__v1_T_180416_194002_549           9113.0               1193.0   7920.0
6        prozober_task_SMP-RunIISummer15wmLHEGS-00222__v1_T_180611_192556_7810           4285.0                142.0   4143.0
7        pdmvserv_task_SMP-RunIISummer15wmLHEGS-00227__v1_T_180416_194004_2476           7852.0               4665.0   3187.0
8      pdmvserv_task_SMP-PhaseIISummer17GenOnly-00022__v1_T_180607_163200_4994           2559.0                 35.0   2524.0
9   asikdar_RVCMSSW_10_2_0_pre6HydjetQ_B12_5020GeV_2018_PU__180629_181916_5151           2434.0                415.0   2019.0
10       pdmvserv_task_SMP-RunIISummer15wmLHEGS-00223__v1_T_180416_193956_2123           1697.0                  2.0   1695.0
11       pdmvserv_task_HIG-RunIISummer18wmLHEGS-00004__v1_T_180604_003844_1884           2612.0                965.0   1647.0
12         pdmvserv_task_HIG-RunIIFall17wmLHEGS-01913__v1_T_180608_124404_6320           2089.0                447.0   1642.0
13  vlimant_ACDC0_task_JME-RunIISummer18wmLHEGS-00002__v1_T_180619_140519_3271           1815.0                510.0   1305.0
14         pdmvserv_task_HIG-RunIIFall17wmLHEGS-00917__v1_T_180606_161025_1459           1498.0                226.0   1272.0
15            pdmvserv_task_EGM-RunIISummer18GS-00016__v1_T_180605_170606_3059           1444.0                222.0   1222.0
16       pdmvserv_task_TRK-RunIISpring18wmLHEGS-00001__v1_T_180524_151259_1146           1483.0                292.0   1191.0
17            pdmvserv_task_EGM-RunIISummer18GS-00024__v1_T_180605_170611_5611           1438.0                388.0   1050.0
18                   pdmvserv_task_HIN-HiFall15-00303__v1_T_180502_124221_8353           1802.0                834.0    968.0
19         pdmvserv_task_SMP-RunIIFall17wmLHEGS-00021__v1_T_180415_203914_1639           1194.0                259.0    935.0

The data sets are collected from the following HDFS locations:

  • JobMonitoring: hdfs:///project/awg/cms/jm-data-popularity/avro-snappy
  • WMArchive/FWJR: hdfs:///cms/wmarchive/avro/fwjr

Is there anything I simply may have missed about the data sets? Any pointers are very much appreciated.

If any additional information is needed, I’ll be happy to provide it.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/CMSSpark/issues/13

mxsg commented 6 years ago

Valentin,

thank you for the pointers. I'll still have to check about the WMArchive status info.

As I understand it, failed jobs are also included in the JobMonitoring data (also confirmed by the exit code field in the data set). I compared the results I am getting with the statistics from this CMS Dashboard page. I assumed it uses the same JobMonitoring data set I should be getting from the HDFS archive, but the job counts are much higher than in my JobMonitoring data set.

In comparison, the WMArchive data seems to match up with the status page much better. For example, in the case of the page linked above (for GridKa and 10.08.2018), the dashboard reports 37887 completed jobs, but I am only getting 10980 entries in the JobMonitoring data set as compared to 30945 records from WMArchive (which also cannot only be accounted for by failed jobs).

For the JobMonitoring data, I am now using a very simple CMSSpark script as I tried to find reasons why I might be missing reports collected with it. It only filters the records and otherwise extracts all columns from them:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
Spark script to parse JobMonitoring records on HDFS.
"""

import calendar
import datetime
import hashlib
import re
import time

from CMSSpark.conf import OptionParser
from CMSSpark.schemas import schema_jm
from CMSSpark.spark_utils import avro_rdd
from CMSSpark.spark_utils import spark_context
from CMSSpark.utils import info
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def jm_date(date):
    """Convert given date into JobMonitoring date format"""
    if not date:
        date = time.strftime("year=%Y/month=%-m/date=%d", time.gmtime(time.time() - 60 * 60 * 24))
        return date
    if len(date) != 8:
        raise Exception("Given date %s is not in YYYYMMDD format")
    year = date[:4]
    month = int(date[4:6])
    day = int(date[6:])
    return 'year=%s/month=%s/day=%s' % (year, month, day)

def jm_date_unix(date):
    """Convert JobMonitoring date into UNIX timestamp"""
    return time.mktime(time.strptime(date, 'year=%Y/month=%m/day=%d'))

# Global patterns
PAT_YYYYMMDD = re.compile(r'^20[0-9][0-9][0-1][0-9][0-3][0-9]$')
PAT_YYYY = re.compile(r'^20[0-9][0-9]$')
PAT_MM = re.compile(r'^(0[1-9]|1[012])$')
PAT_DD = re.compile(r'^(0[1-9]|[12][0-9]|3[01])$')

def dateformat(value):
    """Return seconds since epoch for provided YYYYMMDD or number with suffix 'd' for days"""
    msg = 'Unacceptable date format, value=%s, type=%s,' \
          % (value, type(value))
    msg += " supported format is YYYYMMDD or number with suffix 'd' for days"
    value = str(value).lower()
    if PAT_YYYYMMDD.match(value):  # we accept YYYYMMDD
        if len(value) == 8:  # YYYYMMDD
            year = value[0:4]
            if not PAT_YYYY.match(year):
                raise Exception(msg + ', fail to parse the year part, %s' % year)
            month = value[4:6]
            date = value[6:8]
            ddd = datetime.date(int(year), int(month), int(date))
        else:
            raise Exception(msg)
        return calendar.timegm((ddd.timetuple()))
    elif value.endswith('d'):
        try:
            days = int(value[:-1])
        except ValueError:
            raise Exception(msg)
        return time.time() - days * 24 * 60 * 60
    else:
        raise Exception(msg)

def range_dates(trange):
    """Provides dates range in HDFS format from given list"""
    out = [jm_date(str(trange[0]))]
    if trange[0] == trange[1]:
        return out
    tst = dateformat(trange[0])
    while True:
        tst += 24 * 60 * 60
        tdate = time.strftime("%Y%m%d", time.gmtime(tst))
        out.append(jm_date(tdate))
        if str(tdate) == str(trange[1]):
            break
    return out

def hdfs_path(hdir, dateinput):
    """Construct HDFS path for JM data"""
    dates = dateinput.split('-')
    if len(dates) == 2:
        return ['%s/%s' % (hdir, d) for d in range_dates(dates)]
    dates = dateinput.split(',')
    if len(dates) > 1:
        return ['%s/%s' % (hdir, jm_date(d)) for d in dates]
    return ['%s/%s' % (hdir, jm_date(dateinput))]

def run(date, fout, hdir, yarn=None, verbose=None):
    """
    Main function to run pyspark job. It requires a schema file, an HDFS directory
    with data and optional script with mapper/reducer functions.
    """

    # Define a spark context (main object which allows to communicate with Spark)
    ctx = spark_context('cms', yarn, verbose)
    sql_context = HiveContext(ctx)

    def filter_record(row, conditions=None):
        """Filter a record based on conditions for its fields."""

        if not conditions:
            return True

        for column, condition in conditions.iteritems():
            if row.get(column) != condition:
                return False

        return True

    rdd = avro_rdd(ctx, sql_context, hdfs_path(hdir, date), '', verbose)

    if verbose:
        print("Number of entries before filtering: %s" % str(rdd.count()))

    filter_conditions = {'SiteName': 'T1_DE_KIT'}
    filtered_rdd = rdd.filter(lambda row: filter_record(row, filter_conditions))

    if verbose:
        print("Number of entries after filtering: %s" % str(filtered_rdd.count()))
        print(filtered_rdd.take(1))

    jdf = sql_context.createDataFrame(filtered_rdd, schema=schema_jm())

    # Creates a unique identifier by hashing the provided columns with MD5
    def create_identifier(job_id, start, stop):
        h = hashlib.md5()
        col_list = [job_id, start, stop]
        for value in col_list:
            h.update(str(value).encode('utf-8'))

        return h.hexdigest()

    # Define function for computing the unique identifier and add to data frame
    identifier_udf = udf(create_identifier, StringType())
    jdf = jdf.withColumn('UniqueID',
                         identifier_udf(jdf['JobId'],
                                        jdf['StartedRunningTimeStamp'],
                                        jdf['JobExecExitTimeStamp']))

    if verbose:
        print("Number of entries in data frame: %s" % str(jdf.count()))

    if fout:
        jdf.write.format("com.databricks.spark.csv").option("header", "true").save(fout)

        # optional: jdf.coalesce(1).write... (only writes header once)

    ctx.stop()

@info
def main():
    """Main Function"""
    # Handle script options
    optmgr = OptionParser('jm')
    opts = optmgr.parser.parse_args()
    print("Input arguments: %s" % opts)

    default_hdir = 'hdfs:///project/awg/cms/jm-data-popularity/avro-snappy'
    # default_hdir = 'hdfs:///project/awg/cms/job-monitoring/avro-snappy'

    hdir = opts.hdir if opts.hdir else default_hdir

    # Run main script
    run(opts.date, opts.fout, hdir, opts.yarn, opts.verbose)

if __name__ == '__main__':
    main()

Best, Maximilian

mxsg commented 6 years ago

Valentin, another thing I was previously unsure about is the exact difference between the two HDFS locations with JobMonitoring information:

As far as I understand it, jm-data-popularity contains only a subset of the fields contained in job-monitoring, but should otherwise not be connected to the question concerning the potentially missing reports?

Best, Maximilian

vkuznet commented 6 years ago

Maximilian, couple of questions

On 0, Maximilian Stemmer-Grabow notifications@github.com wrote:

Valentin,

thank you for the pointers. I'll still have to check about the WMArchive status info.

As I understand it, failed jobs are also included in the JobMonitoring data (also confirmed by the exit code field in the data set). I compared the results I am getting with the statistics from this CMS Dashboard page. I assumed it uses the same JobMonitoring data set I should be getting from the HDFS archive, but the job counts are much higher than in my JobMonitoring data set.

In comparison, the WMArchive data seems to match up with the status page much better. For example, in the case of the page linked above (for GridKa and 10.08.2018), the dashboard reports 37887 completed jobs, but I am only getting 10980 entries in the JobMonitoring data set as compared to 30945 records from WMArchive (which also cannot only be accounted for by failed jobs).

For the JobMonitoring data, I am now using a very simple CMSSpark script as I tried to find reasons why I might be missing reports collected with it. It only filters the records and otherwise extracts all columns from them:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
Spark script to parse JobMonitoring records on HDFS.
"""

import calendar
import datetime
import hashlib
import re
import time

from CMSSpark.conf import OptionParser
from CMSSpark.schemas import schema_jm
from CMSSpark.spark_utils import avro_rdd
from CMSSpark.spark_utils import spark_context
from CMSSpark.utils import info
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def jm_date(date):
    """Convert given date into JobMonitoring date format"""
    if not date:
        date = time.strftime("year=%Y/month=%-m/date=%d", time.gmtime(time.time() - 60 * 60 * 24))
        return date
    if len(date) != 8:
        raise Exception("Given date %s is not in YYYYMMDD format")
    year = date[:4]
    month = int(date[4:6])
    day = int(date[6:])
    return 'year=%s/month=%s/day=%s' % (year, month, day)

def jm_date_unix(date):
    """Convert JobMonitoring date into UNIX timestamp"""
    return time.mktime(time.strptime(date, 'year=%Y/month=%m/day=%d'))

# Global patterns
PAT_YYYYMMDD = re.compile(r'^20[0-9][0-9][0-1][0-9][0-3][0-9]$')
PAT_YYYY = re.compile(r'^20[0-9][0-9]$')
PAT_MM = re.compile(r'^(0[1-9]|1[012])$')
PAT_DD = re.compile(r'^(0[1-9]|[12][0-9]|3[01])$')

def dateformat(value):
    """Return seconds since epoch for provided YYYYMMDD or number with suffix 'd' for days"""
    msg = 'Unacceptable date format, value=%s, type=%s,' \
          % (value, type(value))
    msg += " supported format is YYYYMMDD or number with suffix 'd' for days"
    value = str(value).lower()
    if PAT_YYYYMMDD.match(value):  # we accept YYYYMMDD
        if len(value) == 8:  # YYYYMMDD
            year = value[0:4]
            if not PAT_YYYY.match(year):
                raise Exception(msg + ', fail to parse the year part, %s' % year)
            month = value[4:6]
            date = value[6:8]
            ddd = datetime.date(int(year), int(month), int(date))
        else:
            raise Exception(msg)
        return calendar.timegm((ddd.timetuple()))
    elif value.endswith('d'):
        try:
            days = int(value[:-1])
        except ValueError:
            raise Exception(msg)
        return time.time() - days * 24 * 60 * 60
    else:
        raise Exception(msg)

def range_dates(trange):
    """Provides dates range in HDFS format from given list"""
    out = [jm_date(str(trange[0]))]
    if trange[0] == trange[1]:
        return out
    tst = dateformat(trange[0])
    while True:
        tst += 24 * 60 * 60
        tdate = time.strftime("%Y%m%d", time.gmtime(tst))
        out.append(jm_date(tdate))
        if str(tdate) == str(trange[1]):
            break
    return out

def hdfs_path(hdir, dateinput):
    """Construct HDFS path for JM data"""
    dates = dateinput.split('-')
    if len(dates) == 2:
        return ['%s/%s' % (hdir, d) for d in range_dates(dates)]
    dates = dateinput.split(',')
    if len(dates) > 1:
        return ['%s/%s' % (hdir, jm_date(d)) for d in dates]
    return ['%s/%s' % (hdir, jm_date(dateinput))]

def run(date, fout, hdir, yarn=None, verbose=None):
    """
    Main function to run pyspark job. It requires a schema file, an HDFS directory
    with data and optional script with mapper/reducer functions.
    """

    # Define a spark context (main object which allows to communicate with Spark)
    ctx = spark_context('cms', yarn, verbose)
    sql_context = HiveContext(ctx)

    def filter_record(row, conditions=None):
        """Filter a record based on conditions for its fields."""

        if not conditions:
            return True

        for column, condition in conditions.iteritems():
            if row.get(column) != condition:
                return False

        return True

    rdd = avro_rdd(ctx, sql_context, hdfs_path(hdir, date), '', verbose)

    if verbose:
        print("Number of entries before filtering: %s" % str(rdd.count()))

    filter_conditions = {'SiteName': 'T1_DE_KIT'}
    filtered_rdd = rdd.filter(lambda row: filter_record(row, filter_conditions))

    if verbose:
        print("Number of entries after filtering: %s" % str(filtered_rdd.count()))
        print(filtered_rdd.take(1))

    jdf = sql_context.createDataFrame(filtered_rdd, schema=schema_jm())

    # Creates a unique identifier by hashing the provided columns with MD5
    def create_identifier(job_id, start, stop):
        h = hashlib.md5()
        col_list = [job_id, start, stop]
        for value in col_list:
            h.update(str(value).encode('utf-8'))

        return h.hexdigest()

    # Define function for computing the unique identifier and add to data frame
    identifier_udf = udf(create_identifier, StringType())
    jdf = jdf.withColumn('UniqueID',
                         identifier_udf(jdf['JobId'],
                                        jdf['StartedRunningTimeStamp'],
                                        jdf['JobExecExitTimeStamp']))

    if verbose:
        print("Number of entries in data frame: %s" % str(jdf.count()))

    if fout:
        jdf.write.format("com.databricks.spark.csv").option("header", "true").save(fout)

        # optional: jdf.coalesce(1).write... (only writes header once)

    ctx.stop()

@info
def main():
    """Main Function"""
    # Handle script options
    optmgr = OptionParser('jm')
    opts = optmgr.parser.parse_args()
    print("Input arguments: %s" % opts)

    default_hdir = 'hdfs:///project/awg/cms/jm-data-popularity/avro-snappy'
    # default_hdir = 'hdfs:///project/awg/cms/job-monitoring/avro-snappy'

    hdir = opts.hdir if opts.hdir else default_hdir

    # Run main script
    run(opts.date, opts.fout, hdir, opts.yarn, opts.verbose)

if __name__ == '__main__':
    main()

Best, Maximilian

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/CMSSpark/issues/13#issuecomment-414579125

vkuznet commented 6 years ago

Maximilian, here you need to truly understand how data are fed into these areas. As far as I know these data originate from these scripts: https://gitlab.cern.ch/awg/awg-ETL-crons/blob/master/sqoop/cms-jm.sh https://gitlab.cern.ch/awg/awg-ETL-crons/blob/master/sqoop/jm-cms-data-pop.sh which run by CERN IT to extract data records from CMSSW popularity Oracle database. Assuming that extraction step doing its job correctly you need to ask who and how put data into this database.

I'm CC Carl who may answer this question.

Best, Valentin. P.S. Carl I don't know your github user name, could you please respond to github ticket directly that everybody will be in a loop.

On 0, Maximilian Stemmer-Grabow notifications@github.com wrote:

Valentin, another thing I was previously unsure about is the exact difference between the two HDFS locations with JobMonitoring information:

  • hdfs:///project/awg/cms/jm-data-popularity/avro-snappy
  • hdfs:///project/awg/cms/job-monitoring

As far as I understand it, jm-data-popularity contains only a subset of the fields contained in job-monitoring, but should otherwise not be connected to the question concerning the potentially missing reports?

Best, Maximilian

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/CMSSpark/issues/13#issuecomment-414655335

mxsg commented 6 years ago

Valentin,

  • do you run over fwjr collection of WMArchive or both (fwjr and crab ones)?

I am for now only running over the FWJR collection (I previously didn't know that CRAB reports were also collected).

  • you seems to filter records for site T1_DE_KIT, but our site names used by different systems (DBS, PhEDEx, transfer) may have additional names, e.g. T1_DE_KIT_Buffer, T1_DE_KIT_Disk and T1_DE_KIT_MSS. Can you try to match all names via pattern?

I only matched to the T1_DE_KIT identifier as I tested for alternative names by comparing a distinct entries in the unfiltered data set a while ago and only found these unsuffixed names for the sites. I will revisit this again and try to find out whether I am missing suffixed site names after all.

For the example above (jobs from only 10.08.2018), the only distinct site names from the JobMonitoring data set seem to be (only containing T1_DE_KIT):

+--------------------+
|            SiteName|
+--------------------+
|     T2_UK_London_IC|
|           T2_RU_INR|
|      T2_CH_CERN_HLT|
|          T2_RU_IHEP|
|             unknown|
|           T1_ES_PIC|
|        T2_ES_CIEMAT|
|          T2_IT_Pisa|
|           T2_FI_HIP|
|     T2_FR_GRIF_IRFU|
|   T3_UK_London_RHUL|
|       T1_FR_CCIN2P3|
|      T2_US_Nebraska|
|          T2_IT_Bari|
|       T2_CN_Beijing|
|           T2_BE_UCL|
|        T2_BR_SPRACE|
|           T3_KR_KNU|
|           T3_US_UCR|
|          T2_FR_IPHC|
|          T2_RU_JINR|
|         T2_KR_KISTI|
|           T3_TW_NCU|
|      T2_GR_Ioannina|
|        T3_US_Baylor|
|          T2_US_UCSD|
|       T3_TW_NTU_HEP|
|          T2_TR_METU|
|          T1_IT_CNAF|
|       T2_US_Caltech|
|          T2_CH_CSCS|
|       T2_EE_Estonia|
|T3_CH_CERN_HelixN...|
|          T2_CH_CERN|
|        T2_US_Purdue|
|       T2_IT_Legnaro|
|           T2_US_MIT|
|      T2_HU_Budapest|
|           T2_KR_KNU|
|       T3_IT_Trieste|
|          T1_RU_JINR|
|       T3_US_Rutgers|
|         T3_US_UMiss|
|          T2_IN_TIFR|
|        T2_PL_Warsaw|
|        T2_PL_Swierk|
|   T2_UK_SGrid_RALPP|
|           T3_US_PSC|
|          T3_US_TAMU|
|          T2_UA_KIPT|
|          T2_BE_IIHE|
|     T3_IN_TIFRCloud|
|      T3_US_Colorado|
|    T2_PT_NCG_Lisbon|
|   T2_IT_LegnaroTest|
|       T2_US_Florida|
|          T2_DE_RWTH|
| T2_UK_SGrid_Bristol|
|          T2_RU_ITEP|
|          T2_ES_IFCA|
|           T1_UK_RAL|
|       T2_FR_CCIN2P3|
|           T2_PK_NCP|
|          T2_DE_DESY|
|           T3_US_OSG|
|  T3_UK_SGrid_Oxford|
|          T3_FR_IPNL|
| T2_UK_London_Brunel|
|           T1_DE_KIT|
|          T2_IT_Rome|
|  T3_UK_ScotGrid_GLA|
|          T1_US_FNAL|
|    T2_US_Vanderbilt|
|         T3_US_NERSC|
|     T2_US_Wisconsin|
|      T2_FR_GRIF_LLR|
|      T2_CH_CSCS_HPC|
+--------------------+

Thank you again for your help!

Best, Maximilian

vkuznet commented 6 years ago

The Oracle CMSSW Data Popularity database is called CMS_CMSSW_POPULARITY@ CMSR. It is written to by a DB Consumer with service named cmssw-collector running on vocms044. Details are here: https://twiki.cern.ch/twiki/bin/view/CMS/CMSSW-Popularity

---Carl

On 08/21/2018 07:57 AM, Valentin Kuznetsov wrote:

Maximilian, here you need to truly understand how data are fed into these areas. As far as I know these data originate from these scripts: https://gitlab.cern.ch/awg/awg-ETL-crons/blob/master/sqoop/cms-jm.sh https://gitlab.cern.ch/awg/awg-ETL-crons/blob/master/sqoop/jm-cms-data-pop.sh which run by CERN IT to extract data records from CMSSW popularity Oracle database. Assuming that extraction step doing its job correctly you need to ask who and how put data into this database.

I'm CC Carl who may answer this question.

Best, Valentin. P.S. Carl I don't know your github user name, could you please respond to github ticket directly that everybody will be in a loop.

On 0, Maximilian Stemmer-Grabow notifications@github.com wrote:

Valentin, another thing I was previously unsure about is the exact difference between the two HDFS locations with JobMonitoring information:

  • hdfs:///project/awg/cms/jm-data-popularity/avro-snappy
  • hdfs:///project/awg/cms/job-monitoring

As far as I understand it, jm-data-popularity contains only a subset of the fields contained in job-monitoring, but should otherwise not be connected to the question concerning the potentially missing reports?

Best, Maximilian

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/vkuznet/CMSSpark/issues/13#issuecomment-414655335

cvuosalo commented 6 years ago

I'm joining this thread, so you can ask me any additional questions here.

---Carl

vkuznet commented 5 years ago

do we need to keep this ticket open?