astrolabsoftware / fink-broker

Astronomy Broker based on Apache Spark
https://fink-broker.org
Apache License 2.0
68 stars 13 forks source link

ZTF included i band in their mini experiments #838

Closed JulienPeloton closed 2 months ago

JulienPeloton commented 5 months ago

Context

Some alerts have a fid of 3 (i passband) which should not be here! It happened once in 2022 (nid=2008), but that was said to be a mistake. It never happened again until recently in 2024:

# 2022 -- when the mistake happened
+----+-----+                                                                    
| nid|count|
+----+-----+
|2008|29063|
+----+-----+

# 2023 -- no detection in i band
+----+-----+                                                                    
| nid|count|
+----+-----+
+----+-----+

# 2024 -- many detections!
+----+-----+                                                                    
| nid|count|
+----+-----+
|2682| 6777|
|2683|31831|
|2684|28712|
|2657|19565|
|2656|21655|
+----+-----+

One can see that we have two blocks of consecutive nights: [2656, 2657] and [2682, 2683, 2684]. These are correlated to

which are mini experiments from ZTF conducting (explicitly) observations in g, r, and... i bands.

Why this has a broad impact

In 2022, I shipped a fix on the server* for the nightly processing to get rid of observations taken with the i band:

[julien.peloton@vm-75222 bin]$ git diff raw2science.py 
diff --git a/bin/raw2science.py b/bin/raw2science.py
index 8132002..d116a9e 100644
--- a/bin/raw2science.py
+++ b/bin/raw2science.py
@@ -90,7 +90,10 @@ def main():
             .filter(df['candidate.nbad'] == 0)\
             .filter(df['candidate.rb'] >= 0.55)

+        df = df.filter(df['candidate.fid'] != 3)
+        df = df.filter(~F.array_contains(df['prv_candidates.fid'], 3))
         df = apply_science_modules(df, args.noscience)
+
         timecol = 'candidate.jd'
         converter = lambda x: convert_to_datetime(x)
     elif 'diaSource' in df.columns:

*But I never pushed it properly to the main branch... this was a hotfix left here for 2 years! Very bad...

One can see it discards alerts from i band observations, but also any alerts that has i band in its history. This was necessary as all classifiers were trained on g and r data only, and assume so for their inference steps. If we look at the number of alerts from the i band, this is a small percentage of the total:

import pyspark.sql.functions as F

# raw data -- before processing -- as of 2024/05/13
df = spark.read.format('parquet').load('archive/raw/year=2024')

# Number of alerts
df.count() 
# 12,592,654

# Number of objects
df.groupby('objectId').count().count()
# 5,797,825

# alerts with i band
df.filter(df['candidate.fid'] == 3).count()
# 108,540 or 0.9%

# objects with i band
df.filter(df['candidate.fid'] == 3).groupby('objectId').count().count()
# 98,917 or 1.7%

But the problem really comes from the fact that i band measurements leak to the history of alerts, therefore discarding from the processing many alerts taken later in g or r:

# alerts with i-band in the history
df.filter(F.array_contains(df['prv_candidates.fid'], 3)).count()
# 583,337 or 4.6%

We have 5 times more alerts (i.e. each object with 1 i band measurement emitted on average 4 other alerts in g or r band). What puzzled me is:

# objects with i-band in the history
df.filter(F.array_contains(df['prv_candidates.fid'], 3)).groupby('objectId').count().count()
# 367,458 or 6.3%

There are 367,458 unique objects with at least one i measurement in the history, while we counted 98,917 unique objects with i measurement directly. Even adding the bad day in 2022 does not make the count right. Let's see how many actually were detected in 2024:

df\
  .filter(F.array_contains(df['prv_candidates.fid'], 3))\
  .groupby('objectId').count()\
  .filter(F.col('objectId').contains('ZTF24'))\
  .count()
# 250,740

So even here the count is not right, still a factor of 2... Digging further, one should remember that ZTF includes upper limit in the history -- that is when ZTF observed but the variation was not above the 5sigma threshold -- and this apply to i band during the mini survey! Therefore take a galaxy -- you observe it on the 9th of April with your i band, and you found nothing. But then on the 21th April something happens at this location. An alert with a new objectId is emitted, but is discarded by Fink because its history contains an upper limit is i band... This is what happened to e.g. ZTF24aakshsg (see Alerce reported only forced photometry in i and a bunch of i upper limits).

What is next?

The first option would be to keep the current strategy, and filtering out alerts with i band (current measurement or history). This has the advantage of being fast & simple, and keeping downstream classifiers without modifications. The drawback is obviously that we lose valid alerts.

Second option would be to modify the code of all classifiers to not take into account measurements in i band in arrays when they come. This would have the advantage of processing only the relevant part of each alert (g and r portions). But this is not ideal as each classifier would have to mask data, and this is likely to introduce overheads (without mentioning that we will have to code and maintain this exception on many different pieces of code).

Third option would be to do cherry picking: when an alert enter Fink, we just remove any i band information, and keep the rest flowing to the pipeline. This has the advantage of keeping the downstream classifiers as they are. The disadvantage though is that cherry picking has a cost in terms of cycles (alert schema is nested).

I would give a chance to the second option -- but to be honest this is a problem that we will not have in LSST -- so if that turns out to be a mess, I will keep following the first option.