LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Issues/286 #294

Closed wmwv closed 5 years ago

wmwv commented 6 years ago

Add --name option to merge_tract_cat.py and set default to be object.

yymao commented 5 years ago

This PR should probably be updated with the new features that we introduced in #301.

I also wonder if this generic script is still useful, given that for different catalogs, we probably want to partition them differently when storing it, e.g., in tracts for object catalogs, in healpixels (but maybe with small nside) for cosmoDC2.

There are too much repeated code between write_gcr_to_parquet.py in this PR and convert_merged_tract_to_dpdd.py but they are not clearly mergeable. Maybe what we should do is to provide just some helper functions instead of a full script. Or alternatively, write a base class and make subclasses for each type of catalogs?

@wmwv @plaszczy thoughts? @plaszczy what's your experience when converting cosmoDC2?

wmwv commented 5 years ago

Yes, that's exactly where I got stuck on this branch. I wasn't sure what to do about partitioning.

plaszczy commented 5 years ago

I just begun looking at it. but it seems to me it is so obvious that it does not deserve a script. This is just reading with GCR some columns and writing teh parquet from pandas. In my experience there is no sense writing several files (just a hige one) and use 'simple ' scheme. I tested compression also (gzip) : it slows down significantly the writing but reading time is similar so I think it worths it (on 1.2p I wet from 18GB to 13GB so it is sizable). Making all this generic with options seems to me unnecessary. Someone should just write the file. I was planning to just write a few columns not to burn too much disk space but I can do it for all. Tell me. I also noticed that GCR returns float64. Don't you think float32 is sufficient?

plaszczy commented 5 years ago

I noticed also that all magerr quantities are missing

plaszczy commented 5 years ago

also size and the mag_band alias

yymao commented 5 years ago

@plaszczy not all quantities listed in SCHEMA.md are available. The idea is if a quantity whose name matched that in the SCHEMA.md, it has to mean the same thing. But you should use .list_all_quantities to find which quantities are available.

wmwv commented 5 years ago

@plaszczy I agree it's simple. There's a long-term benefit to having a set of scripts that one can use to describe what was done to generate the files.

plaszczy commented 5 years ago

@wmwv sure, I am not advocating to not follow these scripts. But the question is to what level they should be generic and try to match all people needs. In fact by adding many optional arguments you loose the benefit of knowing what was really run (I know there are some defaults but who knows what I used finally?). There is the same problem with documentation and communication, a too extreme granularity. There are so many sources to grab to get the global picture (for cosmoDC2 I made my own page with all the interesting sources). It took me a lot of time to understand how all the dpdd generation works (your srcipt/readme was usefull) but only recently did I (partly) got the GCR internals. And information on slack is often lost in noise. So I think first that the point is to distinguish between developers and users. And make some pages that gather all the information about one single topic.. There is also a very excessive usage of github (because it is easier to say what must be done rather than doing it) . Some time ago I was in HEP (Babar). The difference was that at the very top there was a computing group that was providing very strict guidelines, it was much more efficient. I know it is a different world. But where are we? End 2018 DC2 is still DC1.2 (5deg) and my very first look shows immediately its rubbish. Who still honestly think we will release science papers next year while commissioning begins? This exercise must be analyzed and conclusions drawn.

fjaviersanchez commented 5 years ago

I know there are some defaults but who knows what I used finally?

@plaszczy I agree that can be a problem. This is a bookkeeping problem but the settings can probably be dumped into human readable configuration files (yaml or json) and we can keep track of them in this repo (or other repo or confluence pages...).

There is the same problem with documentation and communication, a too extreme granularity.

Thanks for pointing this out, I guess that these are "the perks" of being a pioneer using these data. We want to make the user experience as good as possible. Having this kind of feedback is very useful for us. If you can think of any resources that would have made your experience better/easier, we are interested in hearing about this.

But where are we? End 2018 DC2 is still DC1.2 (5deg) and my very first look shows immediately its rubbish. Who still honestly think we will release science papers next year while commissioning begins? This exercise must be analyzed and conclusions drawn

Thanks for your work checking these data. Are these the results you are referring to? I am interested in learning more about the problems that you found in the data (I'll follow up with you on this topic privately) and it would be great if you can repeat the same exercise in 1.2i (I think that 1.1p won't give much better results than 1.2p). I do not agree that the data are rubbish though. I think that the edges are deeper (I don't really understand why but @egawiser and @humnaawan may have some insights since I'd say it may be related with the dithering strategy/cutting out the sensors from the outer regions), I'll compute the depth maps to check this (the routine that I used is here and is called depth_map_snr; please, feel free to use it/improve it). 1.7M detected objects sounds low (checking your last plot I see ~14 galaxies/sq-arcmin) for full depth images but, I believe that the depth was a little bit low (at least in old tests that I did when the processing wasn't fully finished, I have to re-check this). Also, we have to take into account that the box goes only up to z=1 but I don't remember if there was some oversampling to compensate for this (@yymao maybe remembers). I am also concerned about the timing and I agree that we should analyze what we have asap (and your contributions are fundamental to do this in a timely manner).

yymao commented 5 years ago

Yes, protoDC2 (which is the extragalactic catalog Run 1.1p, 1.2p, and 1,2i are based on) contains galaxies only up to redshift z=1, and it's not oversampled to compensate as far as I remember.

katrinheitmann commented 5 years ago

@plaszczy Stephane, I am somewhat surprised by the tone of your response here and honestly do not appreciate it. A lot of people are working very hard on providing the data for the collaboration and remarks like “because it is easier to say what must be done rather than doing it” are not appropriate.

First, a couple of thoughts about the aims in DC2. We are attempting a fully end-to-end simulation of LSST data in a reasonably sized patch of the sky. I think it is fair to say that there have not been many (if any) attempts at something like this in the cosmology community before. Therefore, while parts of each step have been done in the past, putting everything together at the scale of DC2 has been proven to be a challenge in itself. We are learning a lot in this process, and of course not everything has gone perfectly, but the characterization of your DESC colleagues' work as “rubbish” is unfair and unscientific. We are encouraging validation efforts in several places, e.g., https://github.com/LSSTDESC/DC2-production/issues/278, and we should make sure that your recent results that you shared in the data access channel/telecon and on github get looked into very carefully (some of the results for Run 1.2i are very likely due to the so far incomplete processing, since we focused only in complete visits, and as Javier points out above, some of Run 1.2p results are most likely due to the dithering pattern). I hope you will continue sharing your findings (positive or negative) that have some bearing on whether the DC2 sims can be used to test our analysis pipelines in the way that we have planned.

The extragalactic catalog had to be created from scratch on a simulation that is more detailed than most that are out there (Euclid has a similar one) and the validation effort was a real tour de force, involving several analysis working groups. We now have an excellent catalog and the first papers are written about it. Then the instance catalog generation at scale and the processing of the resulting data are additional big challenges. Each of the steps is coming together now and we have made a lot of progress during the year on many fronts. Your evaluation of DC2 (rubbish, 5 deg, no science) is incorrect — first, while the size of the test catalogs is 25 deg^2, the size of the extragalactic catalogs is several hundred square degrees, the first science papers are in preparation (you can follow on the paper tracking confluence page), etc. A lot of progress has been made with the DM processing as well. One major aim of DC2 is to develop good data formats and access strategies. These are still under development and need more work — so we sometimes do things that are not perfect yet. Constructive feedback is very welcome. If you want to follow along in more detail, for the extragalactic catalog the discussions are ongoing in the CS working group (there have been many discussions there about the access pattern, the schema etc.), processing and image simulations are discussed in the DC2 DM task force as well as in the CI and SSim telecons, and general DC2 discussions are taking place in the DC2 telecons (I am mostly listing these here for completeness, I assume you are aware of several of those). During these telecons, we have decided as a collaboration on data formats and first data access methods.

The communication via GitHub and Slack could often be more focused but it is important that we give everybody the opportunity to ask questions about the data, etc. in an easy fashion, so I would be very hesitant to ask people to “not say what must be done”. Keeping up the documentation is very important and sometimes we fall behind on this. (Though I must say that the documentation that comes with the notebooks, the GCR, etc. is very detailed and everybody involved has been extremely helpful in answering questions). If you have specific suggestions how to improve on either it would be great to hear them. Also, if you develop something (like a page for cosmoDC2) that you think could be helpful for others, it would be great if you could share it. One reason for the notebooks and the HackurDC2 initiative is to have people interact with the data and provide feedback so that we can improve the infrastructure.

Finally, we should keep GitHub issues and pull requests focused on the topic they are about, so I do not expect a response from you here. I’d prefer the thread to be focused again on the topic it was initiated for. I would be happy to talk more with you and/or to start new discussion threads on improving communication on Slack/Github and documentation in general.

Thanks, Katrin

plaszczy commented 5 years ago

@fjaviersanchez I already sent some results on 1.2i that are here: https://github.com/LSSTDESC/DC2-production/blob/u/plaszczy/nb_run1_2/Notebooks/footprint_Run12i_spark.ipynb

we discussed also with @katrinheitmann ways to estimate a rough number of galaxies but what do we do next? I think it would be very important if you could help us achieve this number (just need a n order of magnitude)

You are right about the post on Run1.2p. Maybe the "borders" are OK (although I though from discussion with Johan this was tracked to be a bug) but where is the "deep" field? And now also: what does it have to do with the similar Run12.i plot? (extinction cannot explain it all)

So let me know if you need more plots, I'll be glad to help.

Finally I point out that all these points are a perfect illustration of what I meant about (over-)communication issues (not your fault, this is structural!)

katrinheitmann commented 5 years ago

The deep drilling filed is in the right corner. See here: https://docs.google.com/document/d/1aQOPL9smeDlhtlwDrp39Zuu2q8DKivDaHLQX3_omwOI/edit

(the design for run 1.1 and run 1.2 are the same)

As I tried to explain above, for Run 1.2i not everything has been processed yet (at least not by Heather at the point when you made the plot), only the "full" visits. So you would not expect a comparison to be meaningful.

johannct commented 5 years ago

Some of the questions raised about number of galaxies etc..... are probably related to this other issue, which maybe does not have all the attention it merits : https://github.com/LSSTDESC/DC2-production/issues/235