Closed MatthewJA closed 7 years ago
Here's a Venn diagram of the three different data sets of ATLAS-CDFS objects used in crowdastro:
Franzen refers to the 2013 catalogue of ATLAS-CDFS objects, Norris refers to objects cross-identified by Norris et al. in the 2006 data release, and RGZ refers to objects presented to volunteers in Radio Galaxy Zoo.
Some more data investigation for labels:
It would be best to have the most agreed labels as the test set. I hope/expect that there would be about 500 of the 531 where a significant number of RGZ labellers agree. If this is the case, then we can use half of this as a test set (and the remaining half for training). Of course, we can take 10 random splits.
For the rest of the training set, if any data set has labelled it, then we can use them as data.
Here's a table with every object and its name in each of the three ATLAS-CDFS tables. The RGZ (Zooniverse) column is the Zooniverse ID and can be used on the radiotalk website to look up objects; the RGZ (source) column is the name in the metadata.source
column in the MongoDB database.
Resolved! The problem is complicated but effectively amounts to a combination of a) using the wrong Franzen catalogue, b) misinterpreting the Norris catalogue, and c) RGZ only including the brightest component of a given component ID.
(~700 Norris classifications are now happily in the test set.)
Norris et al. (2006) looks at a subset of the radio objects in ATLAS-CDFS. We're using a data release which has ~2300 objects; they use a version with ~700 (due to different flux thresholds). This means that ~1600 sources have no groundtruth label (which throughout my honours work means that 1600 sources have an incorrect label — maybe this explains why the classifier is so bad at classifying faint radio objects!).
We could
The first method seems the most obvious, but throws away a huge amount of CDFS. The latter would be pretty neat, noting that such a catalogue wouldn't need to be 100% accurate.
@jbanfield — any thoughts on this?