STScI-Citizen-Science / MTPipeline

Pipeline to produce CR rejected, astrodrizzled, png's of HST WFPC2 solar system data.
6 stars 1 forks source link

Analysis of mtdump-2014-01-19 database dump #114

Closed acviana closed 10 years ago

acviana commented 10 years ago

We have a database of most of the ephemerides for the FITS images that were processed prior to the summer. We need to do some digging into that database dump to see how much more data need to be processed and/or reprocessed.

To start off load this MySQL database dump into your localhost: /astro/mtpipeline/database_dumps/mtdump-2014-01-19.sql

Once you've done that show the command and the output to list the tables in the database.

walyssonBarbosa commented 10 years ago

Here are the command and output to list the tables:

mysql> show tables;
+----------------------+
| Tables_in_test_wally |
+----------------------+
| finders              |
| master_finders       |
| master_images        |
| sub_images           |
+----------------------+
4 rows in set (0.00 sec)
acviana commented 10 years ago

Great. Now show me the following:

walyssonBarbosa commented 10 years ago
mysql> select count(*) from master_images;
+----------+
| count(*) |
+----------+
|    12988 |
+----------+
1 row in set (0.00 sec)
mysql> select count(*) from master_images where drz_mode <> "center" and drz_mode <> "wide";
+----------+
| count(*) |
+----------+
|     3096 |
+----------+
1 row in set (0.00 sec)
acviana commented 10 years ago

Good, now show:

Do these numbers look right? Do the same thing for the center images.

walyssonBarbosa commented 10 years ago
mysql> select count(*) from master_images where drz_mode = "wide";
+----------+
| count(*) |
+----------+
|     5443 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from master_images where name like "%_cr_%" and drz_mode = "wide";
+----------+
| count(*) |
+----------+
|     2932 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from master_images where drz_mode = "center";
+----------+
| count(*) |
+----------+
|     4449 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from master_images where name like "%_cr_%" and drz_mode = "center";
+----------+
| count(*) |
+----------+
|     2158 |
+----------+
1 row in set (0.01 sec)

There's no cr_mode attribute in master_images.

acviana commented 10 years ago

Really? Do you see the same thing I see?

> DESCRIBE master_images;
+------------------+--------------+------+-----+---------+----------------+
| Field            | Type         | Null | Key | Default | Extra          |
+------------------+--------------+------+-----+---------+----------------+
| id               | int(11)      | NO   | PRI | NULL    | auto_increment |
| project_id       | int(11)      | YES  |     | NULL    |                |
| name             | varchar(50)  | YES  | UNI | NULL    |                |
| fits_file        | varchar(50)  | YES  | UNI | NULL    |                |
| object_name      | varchar(50)  | YES  |     | NULL    |                |
| set_id           | int(11)      | YES  |     | NULL    |                |
| set_index        | int(11)      | YES  |     | NULL    |                |
| width            | int(11)      | YES  |     | NULL    |                |
| height           | int(11)      | YES  |     | NULL    |                |
| minimum_ra       | double       | YES  |     | NULL    |                |
| minimum_dec      | double       | YES  |     | NULL    |                |
| maximum_ra       | double       | YES  |     | NULL    |                |
| maximum_dec      | double       | YES  |     | NULL    |                |
| pixel_resolution | float        | YES  |     | NULL    |                |
| priority         | int(11)      | YES  |     | NULL    |                |
| description      | varchar(50)  | YES  |     | NULL    |                |
| file_location    | varchar(100) | YES  |     | NULL    |                |
| visit            | varchar(11)  | YES  |     | NULL    |                |
| orbit            | int(11)      | YES  |     | NULL    |                |
| drz_mode         | varchar(6)   | YES  |     | NULL    |                |
| cr_mode          | varchar(6)   | YES  |     | NULL    |                |
+------------------+--------------+------+-----+---------+----------------+
21 rows in set (0.00 sec)
mysql> SELECT DISTINCT cr_mode FROM master_images;
+---------+
| cr_mode |
+---------+
| no_cr   |
| cr      |
+---------+
2 rows in set (0.04 sec)
walyssonBarbosa commented 10 years ago

I see this:

mysql> DESCRIBE master_images;
+------------------+--------------+------+-----+---------+----------------+
| Field            | Type         | Null | Key | Default | Extra          |
+------------------+--------------+------+-----+---------+----------------+
| id               | int(11)      | NO   | PRI | NULL    | auto_increment |
| project_id       | int(11)      | YES  |     | NULL    |                |
| name             | varchar(50)  | YES  | UNI | NULL    |                |
| fits_file        | varchar(50)  | YES  | UNI | NULL    |                |
| object_name      | varchar(50)  | YES  |     | NULL    |                |
| set_id           | int(11)      | YES  |     | NULL    |                |
| set_index        | int(11)      | YES  |     | NULL    |                |
| width            | int(11)      | YES  |     | NULL    |                |
| height           | int(11)      | YES  |     | NULL    |                |
| minimum_ra       | double       | YES  |     | NULL    |                |
| minimum_dec      | double       | YES  |     | NULL    |                |
| maximum_ra       | double       | YES  |     | NULL    |                |
| maximum_dec      | double       | YES  |     | NULL    |                |
| pixel_resolution | float        | YES  |     | NULL    |                |
| priority         | int(11)      | YES  |     | NULL    |                |
| description      | varchar(50)  | YES  |     | NULL    |                |
| file_location    | varchar(100) | YES  |     | NULL    |                |
| visit            | varchar(11)  | YES  |     | NULL    |                |
| orbit            | int(11)      | YES  |     | NULL    |                |
| drz_mode         | varchar(6)   | YES  |     | NULL    |                |
+------------------+--------------+------+-----+---------+----------------+
20 rows in set (0.00 sec)
acviana commented 10 years ago

Ok, try loading /astro/mtpipeline/database_dumps/mtdump-2014-07-10.sql and running all those commands again.

walyssonBarbosa commented 10 years ago
mysql> show tables;
+------------------+
| Tables_in_mtpipe |
+------------------+
| finders          |
| master_finders   |
| master_images    |
| sub_images       |
+------------------+
4 rows in set (0.00 sec)
mysql> select count(*) from master_images;
+----------+
| count(*) |
+----------+
|    12988 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from master_images where drz_mode <> "center" and drz_mode <> "wide";
+----------+
| count(*) |
+----------+
|     3096 |
+----------+
1 row in set (0.00 sec)
walyssonBarbosa commented 10 years ago
mysql> select count(*) from master_images where drz_mode = "wide";
+----------+
| count(*) |
+----------+
|     5443 |
+----------+
1 row in set (0.00 sec)
mysql> select count(*) from master_images where drz_mode = "wide" and cr_mode = "no_cr";
+----------+
| count(*) |
+----------+
|     3302 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from master_images where drz_mode = "wide" and cr_mode = "cr";
+----------+
| count(*) |
+----------+
|     2141 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from master_images where drz_mode = "center";
+----------+
| count(*) |
+----------+
|     4449 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from master_images where drz_mode = "center" and cr_mode = "no_cr";
+----------+
| count(*) |
+----------+
|     2308 |
+----------+
1 row in set (0.01 sec)
mysql> select count(*) from master_images where drz_mode = "center" and cr_mode = "cr";
+----------+
| count(*) |
+----------+
|     2141 |
+----------+
1 row in set (0.00 sec)
walyssonBarbosa commented 10 years ago

It seems to be missing 1161 wide cr files and 167 center cr files.

So I think we need to process 3096 files and reprocess 1328 files.

acviana commented 10 years ago

Yeah, things don't look great from this dump in terms of completeness.

drz_mode cr_mode records
center no_cr 2308
center cr 2141
wide no_cr 3302
wide cr 2141

What I'm trying to figure out here is if this dump is worth saving or if we should start over from scratch. most of the computation time is in the ephemerides retrieval though. From @kevinfhale's work on Issue #89 we know each request takes <1 sec with the new CGI request interface, so we can estimate this processing time.

Find the number of records in the master_finders table that are connected tomaster_images results with either a wide or center drz_mode. You can do this with a JOIN on the master_images.id and master_finders.master_images_id fields.

Also, give me the number of _single_sci.fits files in the file system, this will give an idea of the completeness of the current database dump. Between these two number we should be able to get some idea of the total processing time required to rebuild this table and thus if it's the right thing to do.

walyssonBarbosa commented 10 years ago
mysql> select count(*) from master_finders inner join master_images on master_images.id = master_finders.master_images_id where master_images.drz_mode = "wide" or master_images.drz_mode = "center";
+----------+
| count(*) |
+----------+
|   225938 |
+----------+
1 row in set (0.06 sec)

I hope the SQL query is correct.

walyssonBarbosa commented 10 years ago

Here is the script I used to get this number:

import glob
list_of_fits = glob.glob('/astro/mtpipeline/mtpipeline_outputs/wfpc2/*_*/*_single_sci.fits')
print len(list_of_fits)
acviana commented 10 years ago

OK, so let's refine that last estimate a little. Is the number of *_single_sci.fits images the same as the number of images that need ephemerides? What objects do need ephemerides for? Hint, this is an astronomy question not a software question.

Based on that can you refine that number above to give a more accurate list of the number of *_single_sci.fits images we need ephemerides for?

walyssonBarbosa commented 10 years ago

No, it isn't. The objects that need ephemerides are: c0m_wide_single_sci.fits, c0m_center_single_sci.fits, cr_c0m_wide_single_sci.fits ands cr_c0m_center_single_sci.fits.

The number of *_single_sci.fits files in the system that need ephemerides is 42496:

There are 112565 images without ephemerides in the database (ephem_x or ephem_y is NULL).

There are ephemerides for 113373 images in the database. We still need ephemerides for 184099 *_single_sci.fits images.

acviana commented 10 years ago

Yes, that's true. However, there is another criteria I was thinking of: moons. The purpose of the ephemerides, as I understand it, is mark incidental observations. Meaning, if Jupiter is the target, I know that Jupiter will be in the image. But, will any of the moons? The only way to calculate that is the find the epemerides for all the moons of each planet.

So where I'm going with this is that we need a way to determine which targets have moons. Obviously this is just the planets, but there is no standardized naming convention enforced. Meaning that if you were observing Jupiter with WFPC2 in the 90s you were free to identify your target as jupiter, JUPITER, Jupiter, Jup, Jup-Red-Spot, Jupiter-S-Hemi, etc. So we'll have to be smart about how we pick those observations out from the db.

acviana commented 10 years ago

Any thought on a query of either the file system or the database that would return the list of observations that have moons?

walyssonBarbosa commented 10 years ago

So, I need to think of a query that would return the number of observations that have moons. But do I need to search among the planets or just the moons? Do the moons need to be in a planet image?

acviana commented 10 years ago

So I think you would need to start with a list of all the target names. You could get that from the DB or from the folder names. Then you need to figure out which ones are planets or moons. I would include moons because given the field of view (FOV) of the cameras it's likely that an exposure of any one moon would include others.

Basically, you want to exclude the observations of comets, asteroids, that one planet without any moons. Those guys.

walyssonBarbosa commented 10 years ago

The moons are listed in mtpipeline/ephem/planets_and_moons.txt, right?

What we could do is get those moons, do a query in the db and compare to see how many of them the db has stored. Do you think this would work?

acviana commented 10 years ago

The names of the moons are listed in mtpipeline/ephem/planets_and_moons.txt but the problem is that those names aren't the names of the target so they won't be listed. Instead I think we need to look at the list of unique target names and decide which ones have moons by the target name and skip the rest.

Then we'll have to figure out a way to the build_master_finders_table.py to only use those records.

walyssonBarbosa commented 10 years ago

Where are the target names located?

walyssonBarbosa commented 10 years ago

This is the script I implemented using the idea I had:

#! /usr/bin/env python

from mtpipeline.get_settings import SETTINGS
from mtpipeline.database.database_interface import Base, engine, session
from mtpipeline.database.database_interface import MasterImages

def get_moons(moons):
    moons_list = []
    with open(moons, 'r') as moons_file:
        for obj in moons_file:
            pair = obj.split(' ')
            if len(pair) > 3:
                if pair[0][-2:] != '99':
                    moons_list.append(pair[1].upper())
    return moons_list

def db_moons():
    count = 0
    found = []
    moons_list = get_moons('mtpipeline/ephem/planets_and_moons.txt')
    for moon in moons_list:
        count += len(session.query(MasterImages.object_name).filter(MasterImages.object_name.like('%' + moon + '%')).all())
    return count

if __name__ == '__main__':
    print db_moons()

It returns 6460 objects.

walyssonBarbosa commented 10 years ago

I checked the filesystem and I got the same result. There are 6460 moon observations in the filesystem.

acviana commented 10 years ago

Run a query to see how many unique object_name values there are in master_images. object_name is the same as targname in the FITS header.

From that list we need to figure out which targets have moons and find a way to save that information so that when we run build_master_finders_table.py it populates those records with none or something for the appropriate fields.

walyssonBarbosa commented 10 years ago

There are 721 distinct objects in the filesystem and 113 of them are moon observations.

I have a problem with my MySQL again. I will show the numbers from the db when I solve it.

walyssonBarbosa commented 10 years ago

Jim rebuilt the MySQL package. I'll have to reprocess everything again. However, I still can't open MySQL in my machine. I sent another email to support but no one answered me yet.

acviana commented 10 years ago

That's frustrating. Not a big deal because it was only a few hours of processing but if this happens again take a database dump before letting them work on it.

walyssonBarbosa commented 10 years ago

I wouldn't be able to do it because I couldn't use MySQL, only if I had saved it when I migrated the data to master_finders.

acviana commented 10 years ago

Right, it's fine. Everyone kills a DB once and learns their lesson about backups :smile:

walyssonBarbosa commented 10 years ago

So, there are 721 distinct objects in the master_images table and 113 of them are moon observations.

walyssonBarbosa commented 10 years ago

What should I do about build_master_finders_table.py?

We could store the list of moons in a file and use it while running the script.

acviana commented 10 years ago

So how did you determine which of the objects in master_images were "moon observations"?

walyssonBarbosa commented 10 years ago

Using mtpipeline/ephem/planets_and_moons.txt (I did not include the planets present on this list). If the object_name in master_images had the name of a moon in the txt file, then I supposed it would be a moon observation.

acviana commented 10 years ago

Can you share, say, the 1st 50 object_names that your method picks as not having moons?

walyssonBarbosa commented 10 years ago

Here are the 1st 50:

DIONE
HERMIONE
IAPETUS-CASSINI-REGIO
IO
IO-018
IO-090
IO-162
IO-234
IO-306
IO-CML154-LARGE
IO-CML199-MEDIUM
IO-CML23-LARGE
IO-CML23-MEDIUM
IO-CML272-LARGE
IO-CML331-LARGE
IO-CML333-MEDIUM
IO-CML99-MEDIUM
IO-ECL-EGRESS
IO-ECLIPSE
IO-ECLIPSE-NOCML
IO-ECLIPSED
IO-LON-045
IO-LON-135
IO-LON-225
IO-LON-315
IO-LON100
IO-LON160
IO-LON215
IO-LON250
IO-LON320
IO-LON340
IO-LON340-2
IO-LON70
IO-NUV-E
IO-NUV-W
IO-TORUS-135
IO-TORUS-225
IO-TORUS-225-1
IO-TORUS-315
IO-TORUS-45
IO-TORUS-ECLIPSE
IO-TORUS-WEST
IO-TRANSIT
IO-V1-LON165
IO-V2-I24B
IO-V3-ACALA
IO-W-FUV
IO-WEST
JUP-IOFOOT-IMAGE
JUPITER-AURORA-IO
acviana commented 10 years ago

Are those the rejected ones? Because 100% of those have moons.

Maybe this is the point that I haven't been clear about, we're not looking for observations where the object_name is a moon. We're looking for observations that may contain moons. The trivial case is when the moon is the target.

The other part of the problem is finding all the observations that could contain moons. For example, any observations of Mars or either of it's moons, Phobos and Deimos, should be checked for incidental moon observations.

Both of these are inherently string parsing issues. Take a closer look at the list you just posted and you should see the issue.

walyssonBarbosa commented 10 years ago

Sorry, I got the wrong ones. Here are the ones you asked:

00CE105
00CF105
00CG105
00CH105
00CK105
00CL104
00CO105
00CP104
00CQ105
00CQ114
00EE173
00FD8
00FS53
00GE147
00GP183
00GV146
00JF81
00JG81
00KK4
00KL4
00OH67
00OK67
00OL67
00ON67
00OO67
00PD30
00PX29
00QE226
00WT169
00WV12
00WW12
00YA2
00YB2
00YH2
00YX1
01HY65
01HZ58
01KF76
01KG76
01KL76
01KN76
01OQ108
01PK47
01QB298
01QD298
01QE298
01QJ298
01QQ322
01QR297
01QR322
acviana commented 10 years ago

Hah, OK, that's better :smile:

Weird, what are these?

walyssonBarbosa commented 10 years ago

I think most of them are KBOs. 00CE105 is in /astro/mtpipeline/mtpipeline_outputs/wfpc2/09060_kbos/.

acviana commented 10 years ago

Yeah, that's probably what it is. But I'm surprised that only 113 of the 721 images are either planets or moons. I can't help but suspect that there's some string parsing issue going on. Take a look at the list and see if anything pops out at you.

walyssonBarbosa commented 10 years ago

No, 113 of them are observations that have moon. I didn't include the planets present in the planets_and_moons.txt file. And it doesn't have the list of all planets.

acviana commented 10 years ago

Include the planets. Finding all the variations in the planet names is actually the problem I want you to solve.

walyssonBarbosa commented 10 years ago

From the 42496 records in master_images, 28908 of them are observations for planets and moons.

The number of distinct observations for planets and moons is 296 (from a total of 721).

acviana commented 10 years ago

Ok so about 75% of the total records and 40% of the total targets. That seems more reasonable to me. And are the excluded distinct observations all looking like KBOs, asteroids, comets, and such?

walyssonBarbosa commented 10 years ago

Yes, they do (like ELEKTRA, VESTA, HALE-BOPP). However, there might be some that are dwarf planets observations like CERES and clusters like NGC1850-CENTER.