LSSTDESC / SRV-planning

Repository to plan and coordinate some of the Science Release and Validation Working Group tasks
3 stars 0 forks source link

Site for SRV Rubin data testing #7

Closed nsevilla closed 1 year ago

nsevilla commented 2 years ago

Different to the question of where will analysis will take place, is where the testing of catalogs will happen.

NERSC: pros: same platform as analysis.

RSP: pros: offers a quicker access to data releases. Cons: resources are limited ("10% of Rubin resources").

IN2P3: is it interesting to ensure interoperability here as well?

We should discuss which is most interesting, and the extent of the limitations of running at the RSP.

jchiang87 commented 2 years ago

The computing resources that are planned to be available via the RSP is covered in the Rubin Observatory Operations Plan (version dated November 5, 2021), in section 17.2.2 Compute Requirements, page 235,

Cores are allocated to user-driven processing in the Science Platform as a ratio of the total available compute system, following the construction Science Requirements Document (LPM-17). On this basis, we assume that:

  • US Data Facility user computing is sized as 10% of the total Data Release Processing compute, amounting to over 500 cores at the start of operations;
  • Chilean Data Access Center computing as sized as 20% of USDF user computing;
  • Project staff computing is sized as 10% of USDF user computing.

It will be possible to dynamically reallocate CPU cores between services at the US Data Facility. In particular, it will be possible to reallocate cores from Data Release Processing to the Science Platform at times of high demand, assuming that the long-term average level of Data Release compute is adequate to meet the release schedule.

and in section 18.3.1 Science Platform Resource Allocation, page 274,

Each RSP user account will be provided with a baseline level of resources, to be determined based on the initial size of the user base and the average number of concurrent users the RSP experiences, in practice. The current plan is for 500 cores for the US DAC in 2023 (increasing annually), dedicated preliminary end-user science analyses (e.g. working on small numbers of images), and the creation of user-generated data products. Qserv has its own resources for catalog queries. For the RSP the minimum resource allocation, per concurrent user, is half a core. A good analogy is one of being given a server with a few TB of disk, a few TB of database storage that is co-located next to Rubin Observatory data, and with a chance to use tens to hundreds of cores for analysis (depending on system load). Cluster resources beyond this baseline allocation will be requested from and approved by a Rubin Observatory Resource Allocation Committee – analogous to a Telescope Allocation Committee; see Section 12.

Finally, in section 12.2 Advisory Committees, page 148, it states

It is anticipated that some individuals or groups of Rubin Observatory users will require storage and/or computational processing resources in excess of the basic quota allocated to all user accounts, and therefore, a Resource Allocation Committee (RAC) is needed to develop and implement a process for handling such requests. The RAC will be analogous to a Time Allocation Committee (TAC) for a telescope, except that here the limited resource would be compute cycles, disk space for storage, and potentially significantly large bulk data downloads. As with a TAC, the RAC would consider the scientific justification for increases to the basic quota for users and be advisory to the Operations Director. For this reason, its membership should include representatives from the science community.

nsevilla commented 2 years ago

I think the best we can do then is try to quantify the resources needed for the core test suite (CPU, RAM) and see how that matches with this general allocation.

There is missing information here, as to how stressed will this 500 cores be right after data release (or, how are resources going to be shared among RSP users?).

nsevilla commented 2 years ago

The possible alternatives are:

All the following alternatives require moving large chunks of data around.

nsevilla commented 2 years ago

I have recently heard in a talk by Will O'Mullane that part of the compute resources will be Cloud based during Operations, I think this is a shift from the previous paradigm.

nsevilla commented 2 years ago

@katrinheitmann, @jchiang87 please comment here if there are any advances with the project on the possibility of getting DP0.2 to NERSC, to study the possibility of using it as a testbed for SRV.

jchiang87 commented 2 years ago

We can follow up with the project. However, a big constraint will be getting the data off of the datastore being used by the IDF (Intermediate Data Facility), which has been processing DP0.2 on Google compute and storage resources. As I understand things, moving data within the system for processing with the Rubin/LSST code, e.g., using the RSP, doesn't cost anything, but exporting the data outside of the system requires real money to be spent. If the DP0.2 data will be copied to SLAC anyway, for example, then there may be a possibility for DESC to obtain a copy. If not, then it seems unlikely.

jchiang87 commented 2 years ago

I asked Melissa Graham about the prospects for getting DP0.2 data at NERSC. She confirmed that bulk downloads of data are not generally available during DP0, but she would ask at the next Data Preview Coordination meeting on Mar 14 if something could be done for DESC.

johannct commented 2 years ago

about the "moving large chunks of data around", IN2P3 will do it anyway, and 50% of a DRP is done there any way, if the current plan stays. What was the outcome of the Mar 14 meeting wrt to @jchiang87 's mention just above?

jchiang87 commented 2 years ago

What was the outcome of the Mar 14 meeting wrt to @jchiang87 's mention just above?

The Project agreed in principle to transfer the object catalogs to NERSC (nominally from USDF), but the details still need to be worked out, i.e., which data products aside from the final multiband object catalogs.

jchiang87 commented 2 years ago

I contacted Melissa today, and she expects that we could arrange for the DP0.2 transfers to NERSC to happen in ~early July, after deployment to the RSP and on-boarding of the new DP0 delegates. We'll discuss with the Rubin Data Preview team at their bi-weekly Tuesday meetings to coordinate these transfers.

jchiang87 commented 2 years ago

Thanks to the help of Melissa Graham and others on the Rubin DP0 team and folks at USDF, we now have copies of the DP0.2 catalogs available at NERSC. There are 9 sets of catalogs in /global/cfs/cdirs/lsst/shared/rubin/DP0.2, including final object catalogs, visit level source catalogs, forced source catalogs, and visit-level metadata:

desc@cori03:/global/cfs/cdirs/lsst/shared/rubin/DP0.2> ls -ld *
-rw-rw----+ 1 desc lsst     317 Jun 15 08:40 00README
drwxrws---+ 2 desc lsst 4194304 Jun 15 08:21 calibratedSourceTable/
drwxrws---+ 2 desc lsst    4096 Jun 15 08:21 ccdVisitTable/
drwxrws---+ 2 desc lsst   32768 Jun 15 08:21 diaObjectTable/
drwxrws---+ 2 desc lsst   32768 Jun 15 08:21 diaSourceTable/
drwxrws---+ 2 desc lsst   32768 Jun 15 08:21 forcedSourceOnDiaObjectTable/
drwxrws---+ 2 desc lsst 1048576 Jun 15 08:21 forcedSourceTable/
drwxrws---+ 2 desc lsst   32768 Jun 15 08:21 match_ref_truth_summary_objectTable/
drwxrws---+ 2 desc lsst   32768 Jun 15 08:21 objectTable/
drwxrws---+ 2 desc lsst    4096 Jun 15 08:21 visitTable/

These are the same data that have been ingested into Qserv for DP0.2.

nsevilla commented 1 year ago

I am closing as de facto we will be using NERSC throughout the Data Preview era, probably, with supporting testing at the RSP. To be opened later if anything has to be discussed further.