Clarify relationship between Norris/Middelberg cross-identifications and ATLAS catalogues

chengsoonong / crowdastro

Cross-identification of radio objects and host galaxies by applying machine learning on crowdsourced training labels.

MIT License

13 stars 1 forks source link

Clarify relationship between Norris/Middelberg cross-identifications and ATLAS catalogues #232

Closed MatthewJA closed 7 years ago

MatthewJA commented 7 years ago

Norris et al. (2006) looks at a subset of the radio objects in ATLAS-CDFS. We're using a data release which has ~2300 objects; they use a version with ~700 (due to different flux thresholds). This means that ~1600 sources have no groundtruth label (which throughout my honours work means that 1600 sources have an incorrect label — maybe this explains why the classifier is so bad at classifying faint radio objects!).

We could

Only look at CDFS objects labelled in the 2006 release?
Ignore the missing labels and use the whole CDFS field anyway?
Find another expert catalogue for CDFS? (I don't believe one exists)
Re-run the method presented in Norris et al. to develop our own expert catalogue?

The first method seems the most obvious, but throws away a huge amount of CDFS. The latter would be pretty neat, noting that such a catalogue wouldn't need to be 100% accurate.

@jbanfield — any thoughts on this?

MatthewJA commented 7 years ago

Here's a Venn diagram of the three different data sets of ATLAS-CDFS objects used in crowdastro:

Franzen refers to the 2013 catalogue of ATLAS-CDFS objects, Norris refers to objects cross-identified by Norris et al. in the 2006 data release, and RGZ refers to objects presented to volunteers in Radio Galaxy Zoo.

chengsoonong commented 7 years ago

Some more data investigation for labels:

Report the RGZ labeller agreements. Are there any differences in the distribution in the four subsets (intersection with Franzen and/or Norris)?
Look at the 4 subsets which are intersections of two or more datasets, studying agreement and disagreement

It would be best to have the most agreed labels as the test set. I hope/expect that there would be about 500 of the 531 where a significant number of RGZ labellers agree. If this is the case, then we can use half of this as a test set (and the remaining half for training). Of course, we can take 10 random splits.

For the rest of the training set, if any data set has labelled it, then we can use them as data.

MatthewJA commented 7 years ago

Here's a table with every object and its name in each of the three ATLAS-CDFS tables. The RGZ (Zooniverse) column is the Zooniverse ID and can be used on the radiotalk website to look up objects; the RGZ (source) column is the name in the metadata.source column in the MongoDB database.

cdfs.txt

MatthewJA commented 7 years ago

Resolved! The problem is complicated but effectively amounts to a combination of a) using the wrong Franzen catalogue, b) misinterpreting the Norris catalogue, and c) RGZ only including the brightest component of a given component ID.

MatthewJA commented 7 years ago

(~700 Norris classifications are now happily in the test set.)