Build Development Database

acviana commented 10 years ago

Based on the information you provided in #114 I've decided it's faster to build a new database.

Once you have completed the script in #121 modify your settings.yaml file to connect to a new database on your local host called something like mtpipeline-dev. This will be a development database that will eventually be used to replace the database dump I sent you. Use your database_reset.py script to create all the tables.

Finally, run build_master_images_table.py on the new database. This should take a few hours to run. You can estimate the run time using my notes in #103 and see how the actual run time compares.

walyssonBarbosa commented 10 years ago

Currently running build_master_images_table.py on mtpipeline_dev. According to your notes, it will take at least 1 hour and 20 minutes to run all of the 42496 images.

I think it will take more than the estimated time to run.

acviana commented 10 years ago

It seems to only be ~1/2 way through after ~90 min:

$ grep -c Processing  build_master_images_table_2014-07-15-16-30.log
24855
$ tail -1 build_master_images_table_2014-07-15-16-30.log 07/15/2014 18:00:25 PM MainProcess INFO: Processing /astro/mtpipeline/mtpipeline_outputs/wfpc2/07429_uranus/png/u43h0105m_cr_c0m_wide_single_sci_linear.png

Take a look at you memory usage. If it's filling up then maybe I should commit the session every few thousand records.

walyssonBarbosa commented 10 years ago

It's almost filling up.

acviana commented 10 years ago

Are we hitting the swap?

walyssonBarbosa commented 10 years ago

It's using 387.8 MB from the swap.

acviana commented 10 years ago

Plot the frequency of the word Processing as a function of time in the log file. I want to see if there is a discontinuity part-way through that would indicate hitting the swap:

grep Processing  build_master_images_table_2014-07-15-16-30.log

walyssonBarbosa commented 10 years ago

Here's the plotting: db_process

acviana commented 10 years ago

So, interpret the plot for me. What do you see? What, if anything, does this suggest for our code?

walyssonBarbosa commented 10 years ago

After 40 minutes running the script, something happened slowing the process by more minutes, maybe it was because it hit the memory limit.

acviana commented 10 years ago

I'm seeing something more complex than that. It's a little hard to tell because you are plotting time as a function of number of ingested records and not the other way around. Also, you are plotting the total number and not the rate of ingestion. But, you can infer both of these things.

Try holding a pen or a ruler up to the line on the plot. You'll notice that the rate is constant, the slows down, then speeds back up so that at the end it's ingesting at almost the same speed as the beginning.

I would like to hear a little analysis from you as to what this means, not just a description of the plot. Do you think this is indicative of hitting the swap? Do you think this is something we need to address now before we ingest the WFC3 and ACS datasets? How bad is the problem?

walyssonBarbosa commented 10 years ago

Using a ruler, I noticed that it started slowing down after 23 minutes, even before it had processed 10 thousand files. Then, as you said, after one hour or so it started being constant. I can't tell if that's an indicative of hitting the swap, for this I would have to tracked the memory usage in all stages of the process. It took around 20 seconds to run each file in average (same as yours when you thought the time was wrong in your notes), I didn't calculate the std though.

We should take a look into this before going through ACS and WFC3, I don't know exactly what to do though. If we start working on ACS and WFC3 now we may take a lot more time to process all the files than if this issue was fixed, if it is a solution for this of course.

walyssonBarbosa commented 10 years ago

I am running build_master_images_table.py again and it's taking ~0.12s to process each file.

acviana commented 10 years ago

:thumbsup:

acviana commented 10 years ago

If this has been completely reloaded and a database dump has been created we can close this.

STScI-Citizen-Science / MTPipeline

Build Development Database #122